gbradburd / conStruct

method for modeling continuous and discrete population genetic structure
35 stars 14 forks source link

Acceptable proportion of unacceptable trace plots in xval?? #71

Closed ClaudHGE closed 2 weeks ago

ClaudHGE commented 3 weeks ago

Dear Gideon,

I hope you are well,

I'm researching population structure in two (?) Eucalypt species and there are good reasons to believe that isolation by distance is a factor in the genetic structure. However, real structure is also likely. I've been working on the conStruct package for a while now but I can't get to any conclusions as I am uncertain that the models are running well. After reading most of the issues raised, I realized I don't need to worry too much about high R-hats, diverging transitions, and tree depth, but I understood that it is key to check the PDF trace plots that the function creates. Regarding the cross-validation analysis, which can incorporate multiple repetitions, how many repetitions should present an acceptable trace plot for the analysis to be valid? and all the Ks as well... (?). I've been running 3 to 8 repetitions, from 1000 to 10000 iterations, and adapt_delta from 0.8 to 0.9 but I believe that many of the trace plots don't look good.

The latest I run: control_params <- list(adapt_delta = 0.9) xvals <- x.validation(train.prop = 0.9, n.reps = 3, # I reduced the number of repetitions to reduce the computation time K = 1:8, freqs = allele.frequencies, # there are zero missing values. Also pruned by LD, HWE, SNP quality, secondaries and more geoDist = geoDist, coords = coords, prefix = "xval_3x1to8x1e4_d90", n.iter = 10000, data.partitions = NULL, parallel = TRUE, n.nodes = NULL, init = "random", control = control_params, cores = 28)

The cross-validation graph looked alright to me (the points sometimes are much more unstable). Apparently, the differences are not significant (t = 1.9772, df = 2, p-value = 0.09332) I've attached some of the trace plots from the xval analysis. Do you think they are acceptable? or could you please provide me with any suggestions regarding the arguments of the function or if I may accept these results?

image xval_3x1to8x1e4_d90_sp_rep1K8_trace.plots.chain_1.pdf xval_3x1to8x1e4_d90_sp_rep2K1_trace.plots.chain_1.pdf xval_3x1to8x1e4_d90_sp_rep2K5_trace.plots.chain_1.pdf xval_3x1to8x1e4_d90_sp_rep2K8_trace.plots.chain_1.pdf xval_3x1to8x1e4_d90_sp_rep1K5_trace.plots.chain_1.pdf xval_3x1to8x1e4_d90_nsp_rep1K5_trace.plots.chain_1.pdf xval_3x1to8x1e4_d90_nsp_rep2K1_trace.plots.chain_1.pdf

Example of a previous (unstable?) xval plots. 9000 iterations but one of the workers failed. image

Another example with a very different behaviour at the K=5. 5000 iterations image

Last one: only 1000 iterations image

I really need some advice as I've been stuck in this for too long,

Thank you for all the previous responses and thanks in advance for this one!

Kind regards, Claudia H. Giraldo E. PhD Candidate, Unimelb

gbradburd commented 3 weeks ago

Hi Claudia,

So, it seems like you have a few different questions. I've summarized what I think they are and responded below. Please let me know if I missed any of your points.

1) Is the mixing in those PDFs you attached acceptable?

Yes, they look good to me! The posterior probability and all model parameters trace plots look like "fuzzy caterpillars" and don't seem to still be "going somewhere" (i.e., trending in a direction). So, I'd say it's reasonable to assume that you've converged on the stationary distribution. What about them looked bad to you?

2) Is it better to do more replicates vs. more MCMC iterations per replicate in a cross-validation analysis?

With K-fold cross-validation (which is what is implemented in conStruct's x.validation function), you generally get less noise in your estimates of predictive accuracy as you increase the number of replicates. So, I'd recommend doing somewhere between 5-15 replicates and picking a number of MCMC iterations that gives you good mixing within each replicate. My guess would be that you could get away with fewer than 1e4 iterations, but I haven't seen the MCMC output for other runs w/ fewer iterations, so I can't say that with certainty.

3) Why is there variation between cross-validation analyses?

In general, the more replicates you do, the less noise there will be, although there will always be some stochasticity in your results. The 2nd and 3rd cross-validation plots you show look pretty similar to me. The first one looks different, but that's maybe not surprising given that you ran it with only 3 replicates. I would guess that your different analyses are generating different results because, with a small number of replicates, each analysis is generating a noisy estimate of the model's ability to predict out-of-sample variation in the data. If you increase the number of replicates, I'd expect parallel analyses to give you the more similar results.

4) More generally, how should you go about doing model comparison using conStruct?

As I emphasize in the conStruct model comparison vignette, there's a difference between statistical significance and biological significance. You might see strong statistical support for a model with a higher value of K, but that model might not make any biological sense. In general, I recommend treating the output of cross-validation analyses in conStruct with a grain of salt, and complementing them with a look at the layer contributions to see how much additional layers are really contributing to the model's attempt to describe the data.

Hope that helps!

ClaudHGE commented 2 weeks ago

Hi Gideon, I am immensely grateful for your prompt and thorough response. I will rerun the cross-validation with more replicates. One of the runs got stuck running for days so I decreased the replicates but I believe that that's atypical and I'll try again.

Regarding the trace plots I was concerned that at some point some drop, but they actually recover and remain stationary. I shouldn't worry any more about this.

I didn't mean to close the issue, but it may be closed for now.

Many thanks, Claudia