esrud / GONE

GONE: Scripts, programs and an example data set
42 stars 2 forks source link

Working with small sample sizes #22

Open pepurquin opened 1 year ago

pepurquin commented 1 year ago

Thanks for developing this program, Armando. It's really great! I'm wondering if you could provide some more detail on how effective GONE would be for very small samples. If my understanding of the MBE paper and tutorial is correct, GONE can run on fewer than 10 individuals, but the chance of getting back biased results is high, especially for recent generations and when there is is population structure or recent admixture (i.e. Figure 2F). However, this can be mitigated to some extent by dropping the hc parameter to 0.05 and possibly by averaging the output from multiple GONE runs. With that in mind, after running GONE a bunch of times on a bunch of species, I have some questions that I hope you can help me with.

1) Are there any conditions where I can be reasonably confident in the Ne estimates from species where I only have between 2 and 9 samples (I have high coverage unphased genomes with lots of SNPs but no recombination maps)? I don't need the Ne results to super precise, but being able to say confidently that species X started a major Ne decline roughly 10, 50, or 100 generations ago would be really helpful for my research.

2) Are there other artifacts I should be on the lookout for than the recent rapid increase/collapse from Figure 2F?

3) Does the artifact in figure 2F occur strictly in very recent generations or can I see it elsewhere?

4) If I see a recent Ne collapse (sudden or drawn out), but there is NOT a sudden increase preceding it, should I trust that the collapse is real? For example, what about cases where the population is more or less stable for 100 generations then suddenly drops, or other situations when the Ne decline is spread out over 5, 10, or 50 generations?

5) If I reduce the hc value to 0.05 (or lower?), should I consider discarding some number of the most recent generations in the Ne plot? For example, in the tutorial example, (if I understand the INPUT FOR GONE section correctly) an hc value of 0.05 corresponds to the first 14 generations in the Ne plot. Does that mean that the Ne values for first 14 generations in that example are not trustworthy or does GONE address this somehow?

6) To what extent can small sample size be compensated for by doing multiple full runs of GONE with different SNP sets? Will averaging the Ne values from 3, 5, or 10 separate runs with 50k random snps make a real difference?

7) You mentioned elsewhere in the forum that GONE is fine with scaffold based assemblies. In the former case, I am only using autosomal scaffolds larger than 1Mb, but for Ne estimation accuracy would it be better to spread ~1 million SNPs across the top 20, 50 or 100 scaffolds (i.e 50k, 20k, or 10k snps per scaffold)?

Sorry for such a massive list of questions! I've read through the tutorial, paper, and forum a few times, and really love the program! If I can be confident in my results it would be a huge win. Thanks again!

armando-caballero commented 1 year ago

I INSERT COMMENTS:

However, this can be mitigated to some extent by dropping the h parameter to 0.05 and possibly by averaging the output from multiple GONE runs.

Yes, it can be mitigated by using hc = 0.05, but averaging the output of multiple GONE runs will not solved the problem. This gives a pseudoreplication with some variation but not a lot as the pedigree of the particular sample used is unique.

Are there any conditions where I can be reasonably confident in the Ne estimates from species where I only have between 2 and 9 samples (I have high coverage unphased genomes with lots of SNPs but no recombination maps)? I don't need the Ne results to super precise, but being able to say confidently that species X started a major Ne decline roughly 10, 50, or 100 generations ago would be really helpful for my research.

Samples from 2 to 9 are very low. You may have quite a lot of noise, and you are considering only a few particular individuals of the population, so you must be very cautious in interpreting the results. The exact time of decline may not be too precise but it can give you a rough approximation.

  1. Are there other artifacts I should be on the lookout for than the recent rapid increase/collapse from Figure 2F?

The rapid collapse after an increase is, from our experience, the most common artefact.

  1. Does the artifact in figure 2F occur strictly in very recent generations or can I see it elsewhere?

It appears in the most recent generations only, as far as we have seen.

  1. If I see a recent Ne collapse (sudden or drawn out), but there is NOT a sudden increase preceding it, should I trust that the collapse is real? For example, what about cases where the population is more or less stable for 100 generations then suddenly drops, or other situations when the Ne decline is spread out over 5, 10, or 50 generations?

The artefact should occur in the most recent generations (5). If you see a collapse before that, it would be possibly real.

  1. If I reduce the hc value to 0.05 (or lower?), should I consider discarding some number of the most recent generations in the Ne plot? For example, in the tutorial example, (if I understand the INPUT FOR GONE section correctly) an hc value of 0.05 corresponds to the first 14 generations in the Ne plot. Does that mean that the Ne values for first 14 generations in that example are not trustworthy or does GONE address for this somehow?

Even if you disregard the windows with the highest c (c> 0.05) for analysis, GONE provides inferences on the most recent generations. You do not need to discard them.

  1. To what extent can small sample size be compensated for by doing multiple full runs of GONE with different SNP sets? Will averaging the Ne values from 3, 5, or 10 separate runs with 50k random snps make a real difference?

No, runs with different sets of SNPs are pseudoreplicates, usually very close to one another. The sample size is much more critical than the sampling of SNPs. The use of a proper genetic map is also critical.

  1. You mentioned elsewhere in the forum that GONE is fine with scaffold based assemblies. In the former case, I am only using autosomal scaffolds larger than 1Mb, but for Ne estimation accuracy would it be better to spread ~1 million SNPs across the top 20, 50 or 100 scaffolds (i.e 50k, 20k, or 10k snps per scaffold)?

The maximum number of chromosomes (or scaffolds) to be analysed is 200. You may consider all of them if you wish. The software combines the information from all of them weighting by their number of pairs of SNPs in each window. If the scaffolds are very short obviously they will only provide information for windows with low c values.