ctlab / GADMA

Genetic Algorithm for Demographic Model Analysis
Other
46 stars 14 forks source link

Lower bound of split in inference of model with structure #92

Closed Enricobazzi closed 4 months ago

Enricobazzi commented 6 months ago

Hello,

I've been trying to run GADMA2 to infer the demographic history of two closely related species.

I was looking for a way of setting a lower bound for the divergence time that is higher than the default. I see that there is an option for the upper bound (Upper bound of first split), but not one for the lower bound of this parameter, which I guess is the general one for time intervals (=1e-15?).

I have tried adding to the parameter file something like:

Lower bound: 10
Parameter identifiers: T

and to the extra parameter file something like:

min_T: 10

But I don't see any change in the inferred times, as I guess these options only work for custom model inference.

Is there a way I could work around this issue? How could I set a lower bound higher than the default for the split time?

Thank you in advance for your answer and please forgive me if I've missed a very obvious way of solving this issue.

All the best,

Enrico

noscode commented 6 months ago

Hi Enrico,

Thank you very much for your feedback. I think I know what is the problem and if so I have already solved it recently. However, I have not published the last release yet, so it is available in the current GitHub repository. Could you please install GADMA from the repository and check if the result will change?

Best regards, Ekaterina

Enricobazzi commented 6 months ago

Hello Ekaterina,

Thank you for your fast reply. I installed the version from the GitHub repository as you suggested and tried it with adding min_T: 10 to the extra parameters and it works! All of the time parameters are now following the set bound (yay!).

I was wondering if there would be a way to take this a step further and limiting only the split time in this way, similarly to how it's possible to set the Upper bound of first split, but for its lower bound. This would allow events within populations to not be bound by the same time intervals as the split time (like maybe a very recent change in population size). I don't know if this is something that can already be done or would need further implementation.

Again, thank you for your reply and for indicating the way to solve the issue.

All the best,

Enrico

noscode commented 6 months ago

Hi Enrico,

Unfortunately, Lower bound of first split is not implemented. I can check if it is something that can be easily implemented. There is a difference between options Upper bound of first split and T_max - the first one is in physical units (generations) and the second one is in genetic units. Which option suit your purpose better? Usually, physical units make more sense. There is one solution how you can set a lower bound in genetic units for a specific parameter. Unfortunately, it includes implementation of the model itself. It can be implemented using the engine you are using or using GADMA interface for model specification (which is not described in the documentation, but is available). The latter option allows inferring the population dynamics, which is a major feature of GADMA.

Please let me know what option is better for you, and we can discuss more about it.

Best regards, Ekaterina

Enricobazzi commented 5 months ago

Hi Ekaterina,

Thank you very much for your answer! I’ll try to answer the different elements of your reply separately.

Unfortunately, Lower bound of first split is not implemented. I can check if it is something that can be easily implemented.

I imagined, as I couldn't find anything in the docs. Would be nice for specific cases like mine, where you would expect within-population event time spans to be much different in scale from between-population event time spans (e.g. an ancestral expansion right before split, or a very recent bottleneck).

There is a difference between options Upper bound of first split and T_max - the first one is in physical units (generations) and the second one is in genetic units. Which option suit your purpose better? Usually, physical units make more sense.

Yes, I realised (thanks to the great documentation!). I think for this purpose physical units would fit better, as I am setting bounds based on previous knowledge of the species, which would be independent of inferred ancestral population size.

There is one solution how you can set a lower bound in genetic units for a specific parameter. Unfortunately, it includes implementation of the model itself. It can be implemented using the engine you are using or using GADMA interface for model specification (which is not described in the documentation, but is available)

Do you mean running GADMA with a pre-determined amount of different models specified either in the chosen engine or through GADMA directly? If this is the case, I would be giving up the most appealing thing about GADMA to me, which is the structured model. I like the idea of the emergence of the “better” model from the data itself, rather than from a predetermined set of models. Do you think that running the structured model a number of times to getting different solutions, then adjusting the bounds in the different solutions and comparing them would be an option?

The latter option allows inferring the population dynamics, which is a major feature of GADMA.

What do you mean by inferring population dynamics? Sorry if I’m missing something obvious here.

Again thanks a lot for your great reply and help!

All the best,

Enrico

noscode commented 5 months ago

Dear Enrico,

According to your answer, It looks like I need to check the possibility to implement Lower bound for population split time, as it is exactly what you need (physical time units etc.). I will check it and return to you with an answer in a couple of days.

Do you mean running GADMA with a pre-determined amount of different models specified either in the chosen engine or through GADMA directly? If this is the case, I would be giving up the most appealing thing about GADMA to me, which is the structured model. I like the idea of the emergence of the “better” model from the data itself, rather than from a predetermined set of models. Do you think that running the structured model a number of times to getting different solutions, then adjusting the bounds in the different solutions and comparing them would be an option?

Yes, in that case you have to specify the model by yourself manually. I totally understand that you do not want to do it and prefer to use structure instead.

The latter option allows inferring the population dynamics, which is a major feature of GADMA.

What do you mean by inferring population dynamics? Sorry if I’m missing something obvious here. Structure model in GADMA has "flexible dynamics": it explores constant size of population, linear and exponential growth for each time epoch and for each population. I call these three options as dynamics and say that this model has flexible dynamics as GADMA automatically choose the best option for each population. As you are not going to give up structure model, you should not worry about it.

Best regards, Ekaterina

noscode commented 5 months ago

Dear Enrico,

I have added a feature of the lower bounds for split events, it is not official release yet, but you can install it using pip:

$ pip install -i https://test.pypi.org/simple/ gadma

Keep in mind that the lower or upper bound of the split should be specified in generations. You can find more information in the documentation here.

Please give it a try and tell me if there are any problems and if something is not clear.

Best regards, Ekaterina

Enricobazzi commented 5 months ago

Dear Ekaterina,

That's amazing! I'll let you know how it goes as soon as I get the chance to try it out, although I might need some time as I have a very busy couple of weeks ahead. Thank you very much for this!

All the best,

Enrico

Enricobazzi commented 4 months ago

Dear Ekaterina,

I used your command to install the version of GADMA with the lower bound feature and it seems to be working as intended. I have been trying it with some simulated data to see how everything works, but now I look forward to running GADMA with my actual data! I'll let you know if I encounter any problems.

Thank you very much for your fast answers and implementation!

All the best,

Enrico

noscode commented 4 months ago

Thank you very much for your feedback!

Best regards, Ekaterina