Defining kmax and problems with gIMble optimize

RSchley commented 2 weeks ago

First of all, thanks so much for developing this great method.

I am working on an extremely rapid tropical tree radiation, and am aiming to infer barriers to gene flow between several co-occurring species pairs, using whole-genome resequencing data at 20x. These trees are expected to display rampant incomplete lineage sorting because of their large Nc and the fact that they are outcrossing, but they do have relatively short generation times considering they are trees.

I have gone through the gIMble pipeline, but the best-selected models that gIMble optimize returns differs depending on which parameter ranges I specify. Specifically, my initial prior ranges often result in the parameter estimate hitting the upper or lower prior bounds (often making no biological sense), and when I relax or change these priors, the best selected model changes.

I was wondering if this is because I am using the 2,2,2,2 kmax mutation configuration in gIMble tally - in the gIMble paper, it says "... the choice of --kmax involves a trade-off between the information gained by distinguishing the probabilities of rare bSFS configurations and the increase in computational complexity."

If I increased this to, for example, 4,4,4,4, would I capture more information in my sequence data, meaning that I have more power to distinguish different models and make more reliable parameter estimates?

Or is there anything else you might suggest to help me make the most robust inferences I can?

Thanks a lot! Rowan

A-J-F-Mackintosh commented 2 weeks ago

Hi Rowan,

Your question was likely directed to @DRL and @GertjanBisschop , but I will post my answer given that I have been in a similar situation.

It is true that increasing the kmax value would generate a more informative bSFS. However, there should be more than enough information with kmax=2,2,2,2 to fit the IM model (as well as the simpler nested models). Setting higher values is unlikely to lead to very different results (most blocks should be unaffected by the kmax value), but it would increase the computation time substantially. For example, I would expect kmax=4,4,4,4 to be prohibitively slow and introduce more error (blocks with >10 SNPs are likely to be mapping artefacts) than useful information. So, I don't think you need to change this at the moment.

The optimisation behaviour you describe suggests that your bSFS cannot be well described by an IM model with sensible parameter values. There could be a few different reasons for this:

It could be that the true history is not IM-like (ghost introgression? intra-pop structure?) and so the observed bSFS cannot be well approximated by the models available in gIMble.
It could be because your blocks are too big. Large blocks with lots of hidden recombination events would lead to a 'underdispersed' bSFS that is hard to fit a sensible model to.
The divergence time could be too recent. If the split is << Ne generations ago then the bSFS contains little information about the Ne values of the two contemporary populations, and the estimates become biased due to hidden recombination. In particular, intra-block recombination leads to a deficit of monomorphic blocks, and this deficit would translate to massive NA and NB estimates when T is very small.

I encountered the problem of too recent a split time when analysing a pair of subspecies with T ~ 0.1 Ne. There is no real solution other than to sync the three Ne values, or alternatively leave them free but interpret the values with caution.

My suggestion would be to start by reducing the block size (unless it is already set to a small value). There is no standard way to determine the right block size, but when rates of mutation and recombination are similar it is best to set the block size so that the number of segregating sites per-block is something like 1, 1.5 or 2. If you reduce the block size and your parameter estimates change dramatically then this suggests that the problem is (at least partially) due to hidden recombination.

Cheers,

Alex

RSchley commented 2 weeks ago

Thank you so much Alex- that makes a lot of sense and your response is very much appreciated. I will start by reducing the block size.

My current block size is 64 - is this deemed relatively small?

A-J-F-Mackintosh commented 2 weeks ago

It depends on the genetic diversity within (and between) the populations that you are analysing. However, diversity would have to be very high for 64 bases to be particularly problematic.

64 bases was used for the Heliconius butterflies in the gIMble paper (where per-site pi is around 0.016 in both species and dxy is 0.022) and I think it worked well there. How does diversity in your populations compare to those Heliconius species (you can get this information from gIMble info)? If it is similar, then perhaps the block size is not the issue. But if it is much higher or lower then it could be worth decreasing or increasing the block size, respectively.

If you have multiple population pairs then you should consider setting the block length independently for each. In this paper we set the block length so that they contained an average of 1.5 segregating sites, and did this separately for each dataset.

KLohse commented 2 weeks ago

Thanks for yor interest in gIMble and your question @RSchley and thanks for the quick and excellent answer @A-J-F-Mackintosh. As @A-J-F-Mackintosh explains, it is best to think of the block length as a relative choice. A useful (but still arbitrary) starting point is l = 2/d_xy, ie. choose your block length such that a pair-block contains on average two differences between haplotypes sampled from different species.

LohseLab / gIMble

Defining kmax and problems with gIMble optimize #144