Lack of convergence (or prohibitely long running times) for analyses for a case of "complex" genetic structure

lcespedesarias commented 1 week ago

Hello!

I am running ConStructor a dataset of approx. 212 individuals, and 4,906 SNPs. When I talk about "complex" genetic structure I mean that this is an complex of populations in which there is one stable hybrid zone in one area, deep genetic divergence associated with a geographic break in another, other areas of very low genetic structure, and so on. Is a widely distributed subspecies/species complex.

I have been able to run the analyses, but I am getting results that don't look very sensible, and for most K values this is (I think) due to the lack of convergence. I am running 4 chains of 1'000.000 steps, using average allele frequencies per locality. But that length of chains does not seem to be enough for convergence. I increased the number of steps to 2'000.000 but it has been running for a week and it seems like it will take a lot longer. If you have any advice on how to make runs more efficient, they are very much appreciated!

Something that I am thinking is that it maybe make sense to do analyses for different portions of the species complex separately. For example, there is a south and a north group that are quite distinct based on other analyses, and I could run independent construct analyses for each group. Would this make sense and potentially help to "simplify" the analysis so convergence can be achieved in a reasonable time?

I used the strategy of averaging allele frequencies per locality to help the running time issue. However, I also have doubts that this is a good idea given the existence of a hybrid zone within the complex. Which implies that individuals in the same place might have very different genetic backgrounds. Do you think in my case I should avoid working with this average and working with the individual values instead?

I am attaching the R script for the K=5 run, and the resulting plots for the spatial model. Any advice on any of this are very much appreciated!

All the best,

Laura

construct_complex_run_K5.R.zip complex_s_k5_pie.map.chain_1.pdf complex_s_k5_structure.plot.chain_1.pdf complex_s_k5_trace.plots.chain_1.pdf

gbradburd commented 1 week ago

Hi Laura,

1e6 or 2e6 steps is a huge number of iterations. Even for large datasets (several hundred samples), the HMC that Stan runs should be able to converge in thousands (rather than millions) of iterations. I wouldn't recommend going above 1e4 steps.
The analysis time is a function mostly of the number of samples, so you could pare that down by either dropping individuals, or collapsing individuals into locality-level samples. It seems like you're maybe already doing that given that you say you're using "average allele frequencies per locality")? What is the final dimension of your dataset (i.e., how many samples are you analyzing)? Is it the full 212, or a smaller number?
If you're getting results that don't make sense to you, there are a few possibilities. One, as you noted, is a lack of convergence. That would likely show up as both trace plots within a chain that look like they're not mixing well or are still "going somewhere" and as lack of concordance between independent runs. You mentioned you're running 4 chains - do you see concordance between the independent chains?
An alternative possibility is that there's insufficient signal in the data to effectively parameterize the model. You mentioned you're using ~5e3 SNPs - have you ensured that they're all being used in the analysis? conStruct's default behavior is to drop loci for which any samples have missing data.
A final possibility is that the model is simply a poor fit to your system. For example, if genetic variation is not geographically partitioned, or if geographic distance between samples is not indicative of how organisms in your system perceive distance, you may get nonsensical results.

Hope that helps! -Gideon

gbradburd commented 1 week ago

Sorry - forgot to add that it's possible that the model is a poor fit to your system because the value of K is too high. Do you get sensical results for smaller values of K?

lcespedesarias commented 5 days ago

Hi Gideon, Thank you so much for your thorough answer! It was all very helpful.

Based on what you have suggested, I will definitely shorten the chain lengths.
Yes, I did collapsed individuals into locality-level samples. This led to 64 samples (localities). I am considering redefining the localities, as there are some that are very close to each other and can be merged. I will also look into maybe dropping individuals instead, which would make sense given my sample is quite uneven.
I see overall concordance between the chains, although it is hard to tell as the results are a bit non-sensical. But they do not look markedly different.
The possibility of not having enough SNPs sounds very plausible to me- there is definitely a good amount of SNPs with missing data for at least one individual. I double checked and at the end only 1,428 SNPs are retained. I can definitely see how that would be insufficient given the geographic extent, and the "complexity" of the system. I think maybe I can try dividing the data set into 2/3 geographic chunks. This way the chances of a given loci having one individual with missing data should be lower (I think- specially since I have some levels of allele dropout, I believe). Another option, I think would be to drop individuals with high levels of missing data. Does this make any sense or do you think given the available data is just not enough to run ConStruct?
I think, overall, that geographic distance does explain quite well genetic variation in this system. Saying this based on, for example, the PC1 being strongly correlated with latitude, and also some preliminary evaluations of IBD by correlating pairwise FST vs geographic distance between localities. I would think that the main issue might be the available number of SNPs with no missing data.
As per your additional point: the results for all K values (including K=2) don't make a lot of sense (which is why the not enough SNPs hypothesis sounds plausible too)

Thank you so much for responding so thoroughly and kindly! I will try the things that I mentioned above for now, but if you have any additional comments/suggestions are the moment they are super welcome!

Thanks!

All the best,

Laura

gbradburd commented 5 days ago

I'm surprised that, even after collapsing individuals into populations, you still are losing ~75% of your SNPs! Although maybe if these are RADseq data and there's a lot of allelic dropout, that's not as surprising? I would definitely recommend dropping individuals with more missing data before scrapping the whole analysis. A thousand SNPs is still a lot of data! And, if you see geographic signal in a PCA or in pairwise Fst, I would guess that there's sufficient information in the data to parameterize the model. Does the inferred pattern of IBD look like it provides a reasonable fit to the data at K=1 for the spatial model?

lcespedesarias commented 5 days ago

Hi Gideon,

Thank you for your reply! Because of your comment, I realized that I am dropping a lot of SNPs before collapsing individuals into populations because I am using very strict missing data filters (in vcftools, before importing data to R). I think is a relict of previous runs when I was running analyses at the individual level. I will try calling SNPs again with less strict filters and will expect that I lose a lot less then since the no missing data filter would apply per population and not individual.

To answer your other question: the inferred pattern of IBD does looks like it provides a somewhat reasonable fit for K=1, but there is at least one marked discrete break in genetic structure (based on other analyses) so at a glance K=2 seems like it fits quite reasonably too. That is a very good point though, I have not tested K=1 in my runs yet, which I should do.

I will definitely keep you updated when the runs with (hopefully) more SNPs run!

Thanks so much!

Best,

Laura

gbradburd commented 1 day ago

Sounds good! Do you want to keep this issue open, or is it resolved?

lcespedesarias commented 1 day ago

Hi Gideon,

Yeah, I think it is resolved. Thanks so much for all your input!

Best,

Laura

gbradburd / conStruct

Lack of convergence (or prohibitely long running times) for analyses for a case of "complex" genetic structure #70