gbradburd / conStruct

method for modeling continuous and discrete population genetic structure
35 stars 14 forks source link

Lack of convergence (or prohibitely long running times) for analyses for a case of "complex" genetic structure #70

Closed lcespedesarias closed 1 day ago

lcespedesarias commented 1 week ago

Hello!

I am running ConStructor a dataset of approx. 212 individuals, and 4,906 SNPs. When I talk about "complex" genetic structure I mean that this is an complex of populations in which there is one stable hybrid zone in one area, deep genetic divergence associated with a geographic break in another, other areas of very low genetic structure, and so on. Is a widely distributed subspecies/species complex.

I have been able to run the analyses, but I am getting results that don't look very sensible, and for most K values this is (I think) due to the lack of convergence. I am running 4 chains of 1'000.000 steps, using average allele frequencies per locality. But that length of chains does not seem to be enough for convergence. I increased the number of steps to 2'000.000 but it has been running for a week and it seems like it will take a lot longer. If you have any advice on how to make runs more efficient, they are very much appreciated!

Something that I am thinking is that it maybe make sense to do analyses for different portions of the species complex separately. For example, there is a south and a north group that are quite distinct based on other analyses, and I could run independent construct analyses for each group. Would this make sense and potentially help to "simplify" the analysis so convergence can be achieved in a reasonable time?

I used the strategy of averaging allele frequencies per locality to help the running time issue. However, I also have doubts that this is a good idea given the existence of a hybrid zone within the complex. Which implies that individuals in the same place might have very different genetic backgrounds. Do you think in my case I should avoid working with this average and working with the individual values instead?

I am attaching the R script for the K=5 run, and the resulting plots for the spatial model. Any advice on any of this are very much appreciated!

All the best,

Laura

construct_complex_run_K5.R.zip complex_s_k5_pie.map.chain_1.pdf complex_s_k5_structure.plot.chain_1.pdf complex_s_k5_trace.plots.chain_1.pdf

gbradburd commented 1 week ago

Hi Laura,

Hope that helps! -Gideon

gbradburd commented 1 week ago

Sorry - forgot to add that it's possible that the model is a poor fit to your system because the value of K is too high. Do you get sensical results for smaller values of K?

lcespedesarias commented 5 days ago

Hi Gideon, Thank you so much for your thorough answer! It was all very helpful.

Thank you so much for responding so thoroughly and kindly! I will try the things that I mentioned above for now, but if you have any additional comments/suggestions are the moment they are super welcome!

Thanks!

All the best,

Laura

gbradburd commented 5 days ago

I'm surprised that, even after collapsing individuals into populations, you still are losing ~75% of your SNPs! Although maybe if these are RADseq data and there's a lot of allelic dropout, that's not as surprising? I would definitely recommend dropping individuals with more missing data before scrapping the whole analysis. A thousand SNPs is still a lot of data! And, if you see geographic signal in a PCA or in pairwise Fst, I would guess that there's sufficient information in the data to parameterize the model. Does the inferred pattern of IBD look like it provides a reasonable fit to the data at K=1 for the spatial model?

lcespedesarias commented 5 days ago

Hi Gideon,

Thank you for your reply! Because of your comment, I realized that I am dropping a lot of SNPs before collapsing individuals into populations because I am using very strict missing data filters (in vcftools, before importing data to R). I think is a relict of previous runs when I was running analyses at the individual level. I will try calling SNPs again with less strict filters and will expect that I lose a lot less then since the no missing data filter would apply per population and not individual.

To answer your other question: the inferred pattern of IBD does looks like it provides a somewhat reasonable fit for K=1, but there is at least one marked discrete break in genetic structure (based on other analyses) so at a glance K=2 seems like it fits quite reasonably too. That is a very good point though, I have not tested K=1 in my runs yet, which I should do.

I will definitely keep you updated when the runs with (hopefully) more SNPs run!

Thanks so much!

Best,

Laura

gbradburd commented 1 day ago

Sounds good! Do you want to keep this issue open, or is it resolved?

lcespedesarias commented 1 day ago

Hi Gideon,

Yeah, I think it is resolved. Thanks so much for all your input!

Best,

Laura