gbradburd / conStruct

method for modeling continuous and discrete population genetic structure
35 stars 14 forks source link

Divergent transition warning #29

Closed dorseyb closed 3 years ago

dorseyb commented 4 years ago

Hello, I'm hoping you might have some advice for diagnosing this problem. I have a data set with 9 populations and ~1600 snps. DAPC analysis finds 5-7 clusters. When I run conStruct with K>1 I always get the warning about divergent transitions. I've increased the adapt_delta parameter without success. Using the parcoord() function suggested in the Stan manual I don't identify any one parameter that is clearly associated with divergences. That said, in most analyses I've run the ancestry proportions tend to bounce from 0 to 1 with very few samples from intermediate proportions. I can send more details (output, Stan diagnostic plots, etc.) if needed.

Thanks very much for any help, Brian D.

petrelharp commented 4 years ago

One cause of bad sampling behavior is that the model doesn't fit - sanity checking the SNP frequencies might give some clues. For instance, are most SNPs fixed in some places and absent in others? How about missingness?

gbradburd commented 4 years ago

Hi Brian,

In addition to @petrelharp's comments, I'm curious about the mixing. You say that the ancestry proportions tend to bounce from 0 to 1 w/ very few samples w/ intermediate admixture proportions. Do they bounce between 0 and 1 within a single run? Or across different runs? Or do you mean that most samples are not inferred to be admixed within a run? Some traceplots could be helpful in diagnosing this.

dorseyb commented 4 years ago

Thanks much for the replies.

I have filtered my data such that all loci are present in all populations, no locus has >30% missing data, no individual has >40% missing, resulting in total missing data of ~6.2%.

For background, along with the DAPC clustering analysis mentioned above, I find significant IBD as measured in adegenet. This seems to indicate a spatial model for clustering would be proper for my data.

What I see is from the traceplots is that the ancestry proportions for different chains are finding different optima, with some infrequent switching between optima for a single chain. This happens with K=2 or 5. Also, the state of the alpha0 parameter is correlated to the state of the w parameter, both switching to different optima at the same time for a given chain.

Thanks very much for any advice! Please let me know if more info would be helpful.

Best, Brian D.

spk2.wplot.pdf spk2.lp_plot.pdf w.traces.example.k5.pdf spk5.alpha_w_plot.pdf spk5.lp_plot.pdf

gbradburd commented 4 years ago

So at K=2, I think what you're seeing is label-switching between independent runs. Basically, the model doesn't care which group is which, so long as all the right individuals draw ancestry from the same groups. As a simple e.g., if you have 10 individuals, and individuals 1-5 draw 100% of their ancestry from one group, and 6-10 draw 100% of their ancestry from another group, then the likelihood of a model in which inds1-5 are 100% in Layer 1, and inds 6-10 are 100% in Layer 2 should be the same as the likelihood of a model in which inds1-5 are 100% in Layer 2, and inds 6-10 are 100% in Layer 1. For a more in-depth discussion, see Rosenberg & Jakobsson (2007) in Bioinformatics.

To deal with this when visualizing output, you can use the match.layers.x.runs function in the conStruct package, which will try to keep labels consistent between independent runs with the same (or different) values of K.

For the K=5 plots, it's possible that you're seeing label-switching within a single run (the purple one), which can definitely happen although it's a little unusual. It's also possible that run just didn't mix well for other reasons. But, the other runs look pretty normal, so if I were you I'd recommend dumping the purple run and looking at the results from the other runs (which seem pretty consistent?).

dorseyb commented 4 years ago

Thanks Gideon,

This is actually what I was thinking after reading Oscarred's thread and it does seem necessary to match the layers. However, after reading the Stan manual about the divergent transitions warning (and trying to diagnose it myself without much luck) I'm not sure if this will solve that problem. If I understand correctly, that warning indicates that a specific chain (or chains) is not sampling the posterior effectively in certain regions and that the results are thus biased and unreliable. This suggests that even if we match the layers across the chains, the posterior sample is still not representative of the actual distribution. I've tried to adjust the adapt_delta parameter (up to 0.999) but with no effect. Do you have any experience with this situation?

Thanks! Brian

gbradburd commented 4 years ago

Getting divergent transitions is a somewhat common issue with mixture models implemented in STAN. In a mixture model, the posterior can be multi-modal (because of label-switching, see longer explanation here). This multi-modality can generate these divergent transitions, which happen when the gradient STAN calculates doesn't match the posterior probability surface well. So, all of that's just to say that there may not be anything wrong with your runs other than the problems introduced by running a mixture model. If you see, across multiple, independent runs at the same value of K, that, after polarizing the admixture proportions to be comparable across the runs, you see convergence on parameter estimates and log posterior probability, I would say you can trust those results and not worry too much about the divergent transitions.

gbradburd commented 4 years ago

Hi Brian - is this issue still unresolved for you? If not, I'll go ahead and close it out.