fgvieira / ngsF-HMM

Estimation of per-individual inbreeding tracts under a probabilistic framework
GNU General Public License v3.0
13 stars 6 forks source link

Impacts of population structure in input data? #10

Closed zjnolen closed 1 year ago

zjnolen commented 1 year ago

Hello!

I was curious if you might be able to shed some light on some best practices for input data in datasets with population substructure. In the paper for ngsF-HMM, it is mentioned that since the level of population structure in the rice dataset is unclear, the accessions are analyzed separately, so my understanding from this is that the input file should contain samples without strong population structure.

In my dataset, I have several sampled populations that all form distinct clusters in NGSadmix and PCAngsd results and have relatively high pairwise Fst between populations for the organism and geographic range we are working with. I have been using the beagle file from the former analyses, which has been linkage pruned with all individuals included, and subset it into per sample population beagle files, then run these through ngsF-HMM. Since it seems the data in the paper all had SNP calling and pruning done per population, I wondered if it might it be more suitable for SNP calling and pruning is done on a population rather than whole dataset level when structure is present? Or inversely, is the subsetting per population even necessary in cases like this?

Thanks for any insights into this you might have and for your work on this tool, it's amazing to be able to ask these questions with low-coverage data!

Best, Zach

fgvieira commented 1 year ago

In my dataset, I have several sampled populations that all form distinct clusters in NGSadmix and PCAngsd results and have relatively high pairwise Fst between populations for the organism and geographic range we are working with.

ngsF-HMM estimates inbreeding by measuring deviations from HWE. As such, all samples analyzed together should belong to the same population and follow HWE assumptions (except for inbreeding). If your populations have high pairwise Fst values, I'd definitely analyze them separately.

Since it seems the data in the paper all had SNP calling and pruning done per population, I wondered if it might it be more suitable for SNP calling and pruning is done on a population rather than whole dataset level when structure is present?

If possible (i.e. you have a reasonable number of samples per population), I'd recommend doing both SNP calling and LD pruning per population. By doing it jointly (and sub-setting it later per population) you will end up with a lot of monomorphic (contain no info and slow down analyses) and/or linked (those that are linked in a pop but not in the others) SNPs. While the former might not have a big impact on the results (apart from slowing down the analysis), the latter can biases your results (since ngsF-HMM assumes independence of sites).

zjnolen commented 1 year ago

Thanks you for your explanation, @fgvieira, that clears things up for me. I had realized I'd end up with monomorphic sites, but not considered how linked sites within discrete pops might bias the results. I'll switch up how I'm doing things to call SNPs and linkage prune per distinct population. Thank you!

Best, Zach