Questions about input data/results interpretation

quinn-ca commented 1 year ago

Hello,

I'm interested in using Gone for my data and had a few questions about input data/results interpretation to determine if the program is appropriate for my data.

My data: I have snp data (generated using ddRADseq), sampled from four sites (sample sizes range n = 20 to n = 40). I do not have a genetic map. My final dataset has ~29K snps.

First, should any data filtering be conducted prior to running this analysis? It makes sense to me to filter for missing data, minimum mean read depth, and minimum allele count (I've noted your caution about minimum allele frequency filtering). Is it appropriate/desirable to retain all snps in a locus (for assessing genetic structure, I retained only a single snp per locus). Should I conduct analyses separately for distinct genetic clusters, given that population structure can affect results?

I ran Gone on one of my sampled sites that appeared to be its own genetic cluster. I used the default parameters (except changed hc = 0.01 per the recommendations in the user guide). The genetic locations column contained '0's. First, I noticed a variable number of snps per chromosome (1200-5500) and only chromosomes 1-13 were considered (14-25 were not included). The results suggest an Ne of 24,000 (generation 1), roughly similar to our most recent census (17,000). Additionally, the results show a drastic decrease in Ne 30 generations ago. A population expansion would make sense with what we know of our study system, although given the amount of data, I understand the timing of this expansion may not be reliable. My focal species is long-lived (30+ years) with overlapping generations. I sampled breeding adults, but cannot be sure of the age of adults sampled. From the paper, I understand that overlapping generations can be a challenge with this analysis.

Overall,

Do I have data sufficient to use Gone?
If so, should I be including/omitting filtering steps prior to running the analyses?
Given my data, are my results for this single sampling site encouraging? Are there any specific interpretations/cautions I need to keep in mind given the potentially overlapping generations sampled?

I sincerely appreciate your help! Quinn

armando-caballero commented 1 year ago

Dear Quinn, I will try to reply to your questions.

My data: I have snp data (generated using ddRADseq), sampled from four sites (sample sizes range n = 20 to n = 40). I do not have a genetic map. My final dataset has ~29K snps.

We have analysed data previously with RADseq and it is ok. 29,000 SNPs may be fine. A genetic map would be better, but if not available it is fine provided the average rate of recombination you asume is correct.

First, should any data filtering be conducted prior to running this analysis? It makes sense to me to filter for missing data,

No. If there are missing genotypes, the software acount for them. It will appear in the output.

minimum mean read depth, and minimum allele count (I've noted your caution about minimum allele frequency filtering).

The read depth is something you have to assess in order to have a good genotyping procedure.

Is it appropriate/desirable to retain all snps in a locus (for assessing genetic structure, I retained only a single snp per locus).

You may consider all your SNPs instead of only one per locus. If you have closer SNPs these will give you information on tightly linked markers.

Should I conduct analyses separately for distinct genetic clusters, given that population structure can affect results?

Yes. The population is assumed to be a close one, withuth admixture. This is a typical issue in many cases.

I ran Gone on one of my sampled sites that appeared to be its own genetic cluster. I used the default parameters (except changed hc = 0.01 per the recommendations in the user guide).

Why not hc=0.05? Using hc=0.01 you may be loosing information on the most recent generations. On the other hand, it there is admixture, using low hc would be wise.

The genetic locations column contained '0's.

Yes. If you do not have a genetic map that column has zeroes but the programe will use the cMMb included in the parameters file.

First, I noticed a variable number of snps per chromosome (1200-5500) and only chromosomes 1-13 were considered (14-25 were not included).

The number of chromosomes to be analysed is set up in the parameters file. If -99 all chromosomes will be analysed. Note that they should be counted 1, 2, 3, … in the map file.

The results suggest an Ne of 24,000 (generation 1), roughly similar to our most recent census (17,000). Additionally, the results show a drastic decrease in Ne 30 generations ago. A population expansion would make sense with what we know of our study system, although given the amount of data, I understand the timing of this expansion may not be reliable.

Do you mean a decrease from a large number to 24,000 in the last 30 generations, whereas you would expect an expansion instead? … who knows, perhaps the admixture is giving some artefacts.

My focal species is long-lived (30+ years) with overlapping generations. I sampled breeding adults, but cannot be sure of the age of adults sampled. From the paper, I understand that overlapping generations can be a challenge with this analysis.

Overlapping generations are a factor but we believe not too importanr as others such as population admixture, wrong genetic map, particular sampling of individuals, etc.

Overall,

Do I have data sufficient to use Gone?

Yes

If so, should I be including/omitting filtering steps prior to running the analyses?

Yes. Try will all your SNPs.

Given my data, are my results for this single sampling site encouraging? Are there any specific interpretations/cautions I need to keep in mind given the potentially overlapping generations sampled?

Those I mentioned above.

With best wishes, Armando.

quinn-ca commented 1 year ago

Hi Armando,

Thank you for your helpful reply! I'll look more into a genetic map and see what is possible with my data, but it's good to know my data could be sufficient.

From your recommendations, I'll create a new set of snps for this analysis with rigorous genotyping protocols, but foregoing filtering for missing data. I will retain all snps per locus and create separate datasets based on my genetic structure results, to avoid issues caused by population structure.

I'm using hc = 0.01 because based on my genetic structure analysis, my sampling sites do appear to be admixed. Thank you for pointing out the maxNCHROM input parameter, I'm not sure how I missed that. I have renamed my chromosomes to be chronological numbers, starting with 1, but I'll update the number of chromosomes also.

Apologies that my description of my results was confusing. The output suggests a population expansion beginning 30 generations ago, and a population expansion is consistent with what we might expect based on anecdotal historic counts. So, I took this as a positive step toward validating the utility of Gone for my study system and data. To put an approximate date to '30 generations ago', I've been reviewing several definitions/equations for calculating generation time for a species, for which they may give very different answers. Does this program calculate 'generations' with a specific definition/equation, or is the interpretation of generation time determined based solely on my understanding of my species' biology?

Thanks again! Quinn

armando-caballero commented 1 year ago

There is no specific definition of generation except the usual one: parent-progeny in a Wright-Fisher model.

armando-caballero commented 1 year ago

Sex chromosomes cannot be analysed directly by GONE at the moment. Do not include them in your input data.

esrud / GONE

Questions about input data/results interpretation #23