gbradburd / conStruct

method for modeling continuous and discrete population genetic structure
35 stars 14 forks source link

Predictive accuracy increasing over all values of K from X-Val #45

Closed alex-sandercock closed 3 years ago

alex-sandercock commented 3 years ago

Hi,

I performed the cross-validation for a dataset of 71 individuals and ~88k SNPs. It seems that the predictive accuracy steadily increases until it approaches 0 at the highest value of K used in the analysis.

I had previously checked this dataset in ADMIXTURE and DAPC for population structure and both suggest K=1, and when I check the layer contributions in conStruct, it supports K=1. So, what could be the reason that I am not seeing the cross-validation figure plateauing at K=1?

Thank you for your help!

Alex

Screen Shot 2021-11-01 at 2 02 55 PM Screen Shot 2021-11-01 at 2 03 13 PM
gbradburd commented 3 years ago

Hi Alex,

The most likely reason is because there is linkage between the SNPs in your training and testing data partitions. If the testing SNPs are not independent of the testing SNPs (which they won't be, if they're linked), the cross-validation procedure will be biased toward more parameter-rich models. Ideally, if training and testing data are independent, the cross-validation procedure works by increasing model complexity until the extra model parameters are describing "noise" in the training partition that isn't shared with the testing partition. At that point, the likelihood of the testing partition (calculated using the model parameterized with the training partition) will either plateau or start to go down. However, if training and testing data aren't independent, that "noise" can be shared, so the likelihood of the testing partition will just keep going up.

Another possibility is that you have such a large number of observations that more complex models are statistically, but not biologically significant in their improved fit. E.g., if you have a pair of individuals that are close-ish cousins and a sufficiently large number of SNPs, it's possible that the model will want to add those two individuals to their own layer to accommodate their anomalously high amount of relatedness. The additional layer offers significantly better fit (in the cross-validation procedure), but doesn't add much to overall model adequacy because the number of individuals in the new layer is small, and their relatedness isn't that unlikely given their geographic separation.

Either way, I'd encourage you to stick with the layer contributions. Although slightly less formal, I think using them for model comparison is much more robust.

alex-sandercock commented 3 years ago

Thank you for such a quick and detailed reply!

I had performed fairly aggressive LD pruning in plink (50 bp window, 10 bp slide, R^2 = 0.1), but maybe this wasn't enough.

I'll stick with the layer contributions, though, can I still correctly assume that IBD is a feature of the dataset from the cross-validation plot? Or should the cross-validation plot as a whole not be given much focus in this situation.

Thanks again,

Alex

gbradburd commented 3 years ago

Yeah I guess I'd say that if you're going with the layer contributions because you're not sure whether to trust the cross-validation procedure, you can't really use the cross-validation results to support your arguments elsewhere. If you want to make a more quantitative statement about IBD in your data (other than just looking at pairwise relatedness plotted against pairwise geographic distance), you could look at the shape of the IBD curve inferred for the layer (or the distribution of the alphaD parameter).

alex-sandercock commented 3 years ago

Sounds good, I'll check that out.

Alex