gbradburd / conStruct

method for modeling continuous and discrete population genetic structure
35 stars 14 forks source link

Chain Does Not Update from Iteration 1 #44

Closed alexkrohn closed 3 years ago

alexkrohn commented 3 years ago

Hey Gideon,

I'm excited to try out conStruct! I've got a 3RAD dataset of 3297 "unlinked" SNPs (one SNP per RAD locus) from 230 snakes. As usual, I want to determine for a range of Ks whether a spatial model or non-spatial model will best describe the data. (I suspect spatial will be better, as IBD is a strong feature in this dataset.)

My plan is to run conStruct(n.chains = 1, K = 5) (k=5 was the most likely K using Structure) to figure out the number of iterations needed to get the MCMC to converge, then running x.validate for a range of Ks and 20 repititions per K with that number of iterations.

Unfortunately, when running the initial conStruct command, it seems to hang at the first iteration. Here's the output from running conStruct:

k5 <- conStruct(spatial = TRUE, 
+                     K = 5, 
+                     freqs = construct.data, 
+                     geoDist = geoDist, 
+                     coords = coords, 
+                     prefix = "spK5-1snplocus", 
+                     n.chains = 1, 
+                     n.iter = 1000, 
+                     make.figs = TRUE, 
+                     save.files = TRUE)

checking data.block

    reading 230 samples
    reading 2883 loci

checking specified model

user has specified a spatial model with 5 layer(s)

SAMPLING FOR MODEL 'space_multiK' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.156347 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1563.47 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)

1563 seconds is ~26 minutes, but even after an hour nothing has happened. The iteration is still at step 1/1000, and the percentage has not increased. This happens with different subsets of the dataset (up to 12k SNPs), different individuals, and regardless of whether I choose 1k iterations or 25k. With 25k iterations, I left it running for 8 hours, but still the iteration counter did not increase. I'm running conStruct on R version 4.1.0 on a Mac.

Any idea why this might be? I'm happy to send data/scripts along if necessary.

I hope you and the family are well, Gideon!

-Alex

alexkrohn commented 3 years ago

Well, I apologize. It does seem to be running correctly -- the 1000 iteration chain took 3 hours to advance to iteration 100. I'm still not sure why it's taking so much longer than the prediction, or why the 25k chain didn't advance at all in 8 hours, but at least it's going.

I assume the only way to reduce the computation time is to reduce the number of individuals. It seems like the 2883 loci used by conStruct is already small enough. Any other advice?

Thanks,

Alex

alexkrohn commented 3 years ago

@gbradburd Final tally was 65017 seconds (= 18 hours) for a 1000 iteration chain, compared to the predicted 28 minutes!

gbradburd commented 3 years ago

Hey @alexkrohn!

It's surprising that the prediction was off, but not super surprising that it was slow with that many individuals. The limiting computational step is the inversion of the parametric covariance matrix, which grows worse than linearly with the number of samples. So, you could definitely get a lot of time savings by dropping individuals (note, conStruct's analysis time isn't determined by the number of loci). If it's important to include all 230 individuals, my recommendation would be to set up multiple independent analyses for each value of K in parallel on a cluster. I'd probably up the number of iterations (somewhere in the 3e3-5e3 vicinity) per run, and just set it and forget it. Then, you can discard any replicate analyses that obviously failed (if there are any), and/or choose the best run per value of K based on the posterior probability.

Once you have that, you can do model comparison by using layer contributions. It's not quite as formal a model selection mechanism as cross-validation, but it's much faster (you can do it with the normal analyses you've run), and, I've found, more robust in the face of linkage between your loci (which you might not know much about). Happy to chat more if you have follow-up questions!

alexkrohn commented 3 years ago

Thanks for the advice, @gbradburd! I'll try dropping some individuals and will see if that improves things. If not, I'll try your parallelization and posterior probability ideas. Much appreciated!