gbradburd / conStruct

method for modeling continuous and discrete population genetic structure
35 stars 14 forks source link

Questions about results interpretation #54

Closed quinn-ca closed 1 year ago

quinn-ca commented 1 year ago

Hello again, thank you/apologies in advance for the number of questions. I greatly appreciate any guidance you can provide!

  1. The paper says to use unlinked loci; I have ddRAD snp data and retained only a single snp per locus. Should I consider pruning for LD? The Admixture manual suggests pruning data for LD using plink, removing snps where r^2 > 0.50 a. Number of snps before pruning: 29556 b. Number of snps after pruning: 22966
  2. Layer contributions a. When K = 3, the third layer has an extremely small contribution (~0.1%). However, at K = 4, the fourth layer contributes moderately. Is this worth considering or should I ‘stop’ at the first value of K where layers begin to contribute marginally? (see Picture 1) b. I’m seeing increasing subdivision among layers at increasing values of K in my spatial model, but less so in the non-spatial
    model. How might I interpret this? The paper addressed spurious clusters in non-spatial models because it’s trying to partition clinal differences into discrete groups, but I’m less sure how to interpret the opposite case, which I seem to have (see Pictures 1 and 2) c. I reviewed the phi values for the spatial models at K = 3 and K = 4, based on an earlier question (conStruct.results$chain_1$MAP$layer.params$layer_k) (https://github.com/gbradburd/conStruct/issues/48) ------> Phi values for K = 3 spatial model: K = 1: 2.41 x e-4, K = 2: 7.53 x e-5, K = 3: 5.07 x e-4 ------> Phi values for K = 4 spatial model: K = 1: 1.64 x e-5, K = 2: 4.09 x e-5, K = 3: 6.58 x e-5, K = 4: 9.36 x e-5

    d. Are the individual layers consistent across values of K for a specific model? ------> For example, the contribution of layer 3 (when K = 3 and K = 4) is ~0.1%, while layer 4 contributes 27% when K = 4 ------> I used the function match.layers.x.runs prior to plotting the layer contributions, by doing so, I take it that, yes, layer 3 is the same across all values of K (for which it was assessed)

  3. Layer covariance curves a. Is this mostly useful to visualize isolation by distance among my layers? Should I be identifying other important patterns? b. Looking at the graph of allelic covariance against distance, it looks like this spurious 3rd layer at K = 3 is driven by a single datapoint (Picture 3). Spurious in that it has such a marginal contribution. Combined with the small contribution of this layer, it seems that K = 3 is not biologically meaningful to describe my data; to point 2a above, should I give credence to K = 4 then? ------> What matrix is used to build this graph? Can I look up what data point is driving the third layer? I see a point at approximately the same value in each of your figures as well. Is this the allelic covariance of the sample with itself? l looked up one potentially applicable matrix (conStruct.results$chain_1$MAP$par.cov), and each sample, when compared with itself, had a value approximately at this value; I’m not sure if this is what is being graphed, however c. If my layer covariance curves are mostly overlapping, how should I interpret this (Picture 3; layers 1 and 2)? d. Additionally, although the CV predictive accuracy of the spatial model outperformed the non-spatial model (indicative of IBD; Picture 4), this was marginal at K = 2. I don’t see a decay in correlation with distance in the layer covariance curves (Picture 3). Is this indicative of IBD in my system, then? e. Can I edit the output of the graphs? In the paper, for example, the data points on the layer covariance curves are color- coded
  4. Compare two runs a. A naive R question, I think: how can I load two results datasets? I ran into trouble trying to ‘load’ results and assign the object a name, so I’ve only been able to load a single object at a time
  5. How is variability among replicates for each value of K summarized? a. Ex., pie charts for a model at the same value of K look different among replicates; the code to graph pie charts (from the vignette) uses the ‘results’ object and MAP to plot the pie chart. If I understand correctly, this graphs the results with the maximum posterior probability across the replicates?

Please let me know if you need any clarification on what I've asked above, and thanks again. I've also attached these questions in a Word document in case the formatting doesn't transfer well.

Questions_conStruct_Results.docx

Picture1.pdf Picture2.pdf Picture3.pdf Picture4.pdf

gbradburd commented 1 year ago

Hi Quinn,

  1. Pruning for LD here is a little tricky. Ideally, you want to prune for statistical LD due to physical linkage (as including closely physically linked loci will lead to pseudoreplication in your data). But, without a genome, it can be tough to tell how much statistical LD between loci is due to physical proximity and how much is due to population structure (which is a signal you care about). So, you definitely want to prune down to one SNP per RAD-locus, but, beyond that I don't have a strong opinion about whether you should prune for LD after doing that. In your case, it seems like you'd still have plenty of data after LD-pruning, so I'd probably try my analyses with and without pruning for LD and see how it affects things (hopefully not much!).

2a. I would probably stop at K=2 if the third layer at K=3 contributes such a small amount. 2b. For Pictures 1 and 2, which is spatial and which is nonspatial? 2c. Was this just supporting information for 2a&b, or is it its own question? 2d. If you matched layers across runs before plotting layer contributions, then layer contributions will be consistent across values of K.

3a. Yes, this is mostly useful for visualization. Sometimes it's neat to check whether the shape of the curves in different layers matches your expectations (like we saw in the conStruct paper comparing results for Populus trichocarpa and balsamifera). 3b. I believe that you're misreading the plot. I think the "outlier point" you're referring to is actually the diagonal of the sample allelic covariance matrix, which describes each samples' relatedness to itself, and which is automatically fixed at 0.25. Remember that each of the layer-specific covariance curves is being weighted by their admixture proportions in order to describe each observation of pairwise relatedness. That's why the "model.fit" plots (hopefully) look good even though the layer covariance curves might not model the data on their own particularly well. 3c. If your layer covariance curves are overlapping that indicates that each layer displays a similar rate of isolation by distance. 3d. As we note in the paper, there's a difference between statistical and biological significance. Your cross-validation results show strong and unambiguous support for the spatial K=3 model over all other models. But, maybe that statistical support comes at the expense of the biology, and the spatial K=2 model makes a lot more sense given what you know about your system. One note is that the IBD curves within each layer may be hard to see over the scale of the x-axis and y-axis at which you're viewing them. That is, there may be meaningful IBD within the red & blue layers, but you just can't see it on that particular plot. 3e. There's no way to edit the output of these graphs using the built-in functions in the R package, but you can certainly use the code as a starting point to make whatever plots you wish! If you want to see inside the code of a particular function in the conStruct package, you can either type it into your console without any arguments (e.g., make.all.the.plots) or, if it's not an exported function, you can access it using the ::: syntax (e.g., conStruct:::plot.layer.covariances).

  1. You can load multiple R objects by loading one, assigning it to a new variable, then loading another. E.g.

    load("K1_sp_conStruct.results.Robj")
    k1out <- conStruct.results
    load("K2_sp_conStruct.results.Robj")
    k2out <- conStruct.results
    load("K2_sp_conStruct.results.Robj")
    # etc.
  2. Yes, the make.admix.pie.plot function (as called from inside make.all.the.plots) visualizes the MAP admixture proportions for each chain. If you want to look at heterogeneity in the MAP estimates across chains, you can make admixture pie plots (or structure-style plots) for each independent chain.

hope that helps!

quinn-ca commented 1 year ago

Thank you for your helpful answers and clarifications!

  1. I compared results between the full and pruned datasets. As seen in the CV, the predictive accuracy improved somewhat in the pruned dataset (Picture 5). The greatest predictive accuracy increased from K = 3 (full dataset) to K = 4 (pruned dataset). The spatial model has a greater predictive accuracy in both datasets, and this difference is smallest at K = 2. The contribution of layer 3 in the spatial model (when K = 3) is greater in the pruned dataset, compared to its marginal contribution in the full dataset (Picture 6). In both datasets, I see increasing subdivision into layers in the spatial model (Picture 6) compared to the non-spatial model (Picture 7). The results from the two datasets (full vs. pruned) are more similar for the non-spatial models, compared to the spatial models. I do see different geographic distributions of the layers in the pie charts between the full and pruned datasets in the spatial model (less noticeable at K = 2, but more prominent when K = 3; Pictures 8-9).

The non-spatial results are most similar to what I have found with PCA and pairwise-Fst analyses. The southwest site clustered by itself in the PCA, with other sites overlapping. Pairwise Fst showed the biggest differences between the southwest site and all others; pairwise Fst between the two eastern sites was not significant. Notably, the two northwest sites were combined in pairwise Fst estimates because their combined sample size is 19 individuals.

  1. Apologies, Picture 1 show results for the spatial model and Picture 2 for the non-spatial model. The phi values were provided for additional context.

3b. Your answer about the "outlier" being the diagonal in the sample allelic covariance matrix makes sense and is what I started to uncover when I looked at the matrix itself, confirming this is very helpful, thanks! The model fit plots seem to look good for the spatial model at K = 2 (Picture 10), although there may be some deviations at K = 3 (Picture 11). The model fit of the pruned dataset may be inappropriate at K = 2 (Picture 12), but particularly at K = 3 (Picture 13). For comparison, I’ve also included model fit graphs for the full-dataset, non-spatial model (K = 2; Picture 14). To confirm, I’m assessing these graphs to ensure there aren’t patterns across the MCMC chains, correct? Are there other deviations in the model fit plots I should be assessing?

  1. Perfect, thank you! I didn't think to load the object and assign the 'conStruct.results' to an object.

I really appreciate your help! I’m meeting with my committee member soon to discuss my results, but they aren’t familiar with this program, so I want to make sure I have a solid interpretation to build our discussion on.

Quinn

Picture 5

Picture5

Picture 6

Picture6

Picture 7

Picture7

Picture 8

Picture8

Picture 9

Picture9

Picture 10

Picture10

Picture 11

Picture11

Picture 12

Picture12

Picture 13

Picture13

Picture 14

Picture14
gbradburd commented 1 year ago

To confirm, I’m assessing these graphs to ensure there aren’t patterns across the MCMC chains, correct? Are there other deviations in the model fit plots I should be assessing?

The "model fit" plots are specifically the "Cov/geoDist" plots (in the lower RH corner of several of the uploaded figures). The other plots output for MCMC diagnosis show parameter trace plots (estimates of the parameter value over the duration of the MCMC run). Ideally, the "model fit" plots show that the model is able to generate a parametric covariance matrix that is similar to (and captures the major trends of) the empirical data. And the traceplots should look like "fuzzy caterpillars" that show consistent results across independent runs of the same model (i.e., their converging on the same stationary distribution).

gbradburd commented 1 year ago

Hi Quinn,

Just doing some book-keeping here. Are you ok to close this issue?

gbradburd commented 1 year ago

Hi Quinn,

I'm going to close the issue, but feel free to re-open if you have further questions/concerns on this point.

Dahn-YoungDong commented 1 year ago

@gbradburd Hello, are independent runs of the same model called "chains" or "replicates" ?

In the main construct function, there is option for running # of chains, but not replicates. Whereas, In the x.validation function, there is only option for running # of replicates, but not # of chains.

Are chain and replicate the same thing? If not, how are they different? And which should I use to assess what you described in the vegnette: "

Ideally, multiple independent runs converge on the same stationary distribution, with similar parameter estimates and posterior probabilities. "

Best, Dahn-young

gbradburd commented 1 year ago

Hi Dahn-young,

chains and replicates are different things in this context. A chain here is being used as defined in rstan - as an independent MCMC analysis. A replicate is a cross-validation replicate. The cross-validation procedure is discussed in depth in the "Cross-validation Procedure" section of the conStruct paper appendix.

If that's the section of the vignette you're reading, it sounds like you want to run multiple independent chains for a particular conStruct analysis.

Hope that helps! -gideon