CenterForStatistics-UGent / RCM

A model-based visualization method for microbiome data
16 stars 2 forks source link

issues with convergence and limited number of samples shown in constrained analyses #13

Closed FMKerckhof closed 3 years ago

FMKerckhof commented 4 years ago

Dear, I do not know if related to (#7) but on a real-life dataset (analysed using DADA2, 551 ASVs over 96 samples), I am getting some unexpected output of constrained RCM ordination starting from a phyloseq object, whereas the unconstrained ordination does work without any issues. The resulting object of the constrained ordination also does not have a $rMat & $cMat slot, making it difficult for further plotting.

With default options, the following error is returned when running the constrained RCM: Error in nleqslv(initParam[, i], reg = design, fn = dNBllcol_constr, theta = thetas[i],non-finite value forx[1]supplied to function

Since this appears to be a nleqslv issue, I tried adjusting the nleqslv.control options. When calling the constrained RCM with the adjusted parameters, a solution appears to be found (that is, no errors or warnings are returned):

RCM_2_lin <- RCM(phyloseqtable_2,covariates=c("Factor1","Factor2","Factor3"),responseFun = "linear", verbose=TRUE, allowMissingness=TRUE,nleqslv.control = list(maxit = 1e3L, cndtol = 1 -
  18))
(...) output truncated (...)
Outer Iteration 29 

Old psi-estimate:  35.48689 
New psi-estimate:  35.48875 

 Estimating psis (k = 2) 

 Estimating response function 

 Estimating environmental gradient 

However, when plotting this object, only 9 out of the 96 samples (?) are shown - exactly the amount of all 3 factors with three levels, and the object does not have a cMat or rMat to try to recreate the plot myself. Any pointers on 1) how to select nleqslv options, 2) what could be going on in the plot?

JoFAM commented 4 years ago

@sthawinke I've looked at it but feel this is a bit out of my league. Do you have time to take a look at what might go wrong?

sthawinke commented 4 years ago

Sorry for the late response, please let me know if this answers your questions.

FMKerckhof commented 4 years ago

dear @sthawinke , probably it is indeed a problem-specific issue, I have had multiple colleagues running into the same problem with the constrained ordination not working "as expected" on a real-life dataset. ExtractCoord does appear to work. I guess we are all used to CCA triplots, which generally do show all the samples, since CCA works in a different way than RCM. To summarise, if I understand it correctly, if want to see how my samples ordinated in a constrained RCM framework, I need to supply constraints with as many combinatorial levels as I have samples? I think I remember reading in Numerical ecology with R (Borcard,Gillet & Legendre) that adding too many constraints to a constrained ordination is no different than having an unconstrained ordination (not unlike "overfitting"). What is your take on this?

sthawinke commented 4 years ago

The problem here is that all your variables are factors, leading to a limited number of combinations. Adding a single continuous variable would already solve this problem, as such a variable usually has as many unique values as there are samples. I agree that you need to supply constraints with as many combinatorial levels as you have samples, if you want to separate all samples, which is by no means obligatory.

The link between unconstrained and constrained analyses is that a constrained analyses with a single factor with unique values for each sample equals an unconstrained analysis. But for your dataset with 96 samples, even an ordination with e.g. three factor variables and one continuous one would be meaningful, as it shows which of these contributes most to the sample ordination. You are having the opposite problem: by providing too little room for variability in the variables supplied, the ordination is too constrained. I think (but am not sure) that the numerical problems are caused by this too. Adding variables actually makes the problem less constrained.

For CCA (and other constrained methods) I would expect the same problem for your case actually, have you tried it?

kdpaepe commented 4 years ago

Many thanks for the swift response!

For CCA (and other constrained methods) I would expect the same problem for your case actually, have you tried it?

We have not tried CCA, but RDA gives a nice separation of all samples (see figure below) without providing a continuous variable (in this specific case, this is the time variable in which we are not really interested. Supplying this variable generates as many combinatorial levels as samples).

image

Another issue presents itself when adding the continuous variable, as one of the other design factors that we are interested in defines the effect of a spike-in which was performed at a certain timepoint. Hence, there is some overlap in information and this seems to result in the following error message: 'Error in checkAlias(datFrame, covariatesNames) : Sample variables are aliased with other variables. Drop some sample-variables and try again.'

Removing the spike-in variable resolves the issue, but now I run into another issue: _'Error in nleqslv(initParam[, i], reg = design, fn = dNBllcolconstr, theta = thetas[i], : non-finite value for x[1] supplied to function' I am not sure what the underlying problem is this time? It only shows up after iteration round 6.

sthawinke commented 4 years ago

It took me a while to realize, but I guess the sample scores you plotted are weighted species scores rather than linear combinations of sample variables (see also ?vegan::scores.cca in R, the latter choice would also lead to overlap). As a result, they do not directly relate to the ordination of the sample variables you plotted and hence it is not clear to me how this plot should be read. The intuitive "biplot" principle is lost. This trick is also sometimes used to add species scores to PCoA plots, where it is equally difficult to interpret.

Some variables are aliased, i.e. some are a linear combination of some others, or a perfect overlap in information as you say. Is the time variable well interpreted as continuous? And how many unique values does it take? If it is seen as a factor, and its levels only correspond to pre or post spike this can indeed occur, and I agree to remove the less specific spike in variable.

This is a numerical issue, the algorithm struggles with the high number of samples/low number of constraints problem. I will try to fix this tomorrow, by allowing you to modify the starting values a and b as for the unconstrained case, but no guarantee that this will work.

kdpaepe commented 4 years ago

It took me a while to realize, but I guess the sample scores you plotted are weighted species scores rather than linear combinations of sample variables (see also ?vegan::scores.cca in R, the latter choice would also lead to overlap). As a result, they do not directly relate to the ordination of the sample variables you plotted and hence it is not clear to me how this plot should be read. The intuitive "biplot" principle is lost. This trick is also sometimes used to add species scores to PCoA plots, where it is equally difficult to interpret.

Sorry, I should have provided you with more information. Some more details: I used the 'wa' (weighed average) option and type II scaling to generate the above plot.

From Borcard 2011, numerical ecology with R: _The two RDA triplots use the site scores that are weighted sums of species (abbreviated “wa” in vegan). The choice between these and the fitted site scores (abbreviated “lc”) for the triplots is still controversial. On the one hand, the fitted site scores are strictly orthogonal linear combinations of the explanatory variables; but they nevertheless represent clearly and exclusively what can be modelled using the explanatory variables at hand. On the other hand, the site scores that are weighted sums of species. appear more robust to noise in the environmental variables: McCune (1997) showed that if the latter contain much random error, the resulting lc plots may be completely scrambled. However, weighted sums of species (wa) scores are “contaminated” scores, halfway between the model fitted by the RDA procedure and the original data, and as such it is not entirely clear how they should be interpreted. When RDA is used as a form of analysis of variance (Sect. 6.3.2.8), all replicate sites with the same combination of factor levels are represented on top of one another in the fitted site scores (lc) triplot; the weighted sums of species (wa) triplot is preferable in that case because the sites are separated in the plot and their labels can be read.

Some variables are aliased, i.e. some are a linear combination of some others, or a perfect overlap in information as you say. Is the time variable well interpreted as continuous? And how many unique values does it take? If it is seen as a factor, and its levels only correspond to pre or post spike this can indeed occur, and I agree to remove the less specific spike in variable.

The spike-in variable can be seen as a grouping of timepoints (pre and post spike-in) and it is this variable I am mainly interested in (together with some other variables), and not so much the continuous time variable (which is numeric and can take 15 different values). I would prefer to keep the spike-in variable in the model as to quantify its effect and significance (e.g. like in RDA with permutation tests and R2adj) on the community (which are clear from the PCoA/unconstrained RCM model).

This is a numerical issue, the algorithm struggles with the high number of samples/low number of constraints problem. I will try to fix this tomorrow, by allowing you to modify the starting values a and b as for the unconstrained case, but no guarantee that this will work.

Thank you very much for your help!

sthawinke commented 4 years ago

Thanks for the clarification. I am glad to be on the same page with Borcard that the wa-scores are hard to interpret. I pondered on whether to include this option (i.e. wa-scores) in the package, but decided against it because

  1. The biplot principle does not hold, and these plots have no mathematical interpretation
  2. Matters are further complicated by the fact that features in the unconstrained RC(M) model with linear response functions are arrows with unique origins.

You could do this yourself though, by extracting the species arrow coordinates with extractCoord(), and then calculate the weighted average of their starting and end points for every sample to obtain wa-scores (or rather wa-arrows), unique for each sample. Still as before, it is not clear what these plots really mean.

I understand the demand for unique sample scores, but it is a bit contradictory to the idea underlying constrained analysis, whereby samples are only discriminated through the sample variables provided. You may want to use the unconstrained plot and colour the samples by the most important variables perhaps?

The spike-in variable can be seen as a grouping of timepoints (pre and post spike-in) and it is this variable I am mainly interested in (together with some other variables), and not so much the continuous time variable (which is numeric and can take 15 different values). I would prefer to keep the spike-in variable in the model as to quantify its effect and significance (e.g. like in RDA with permutation tests and R2adj) on the community (which are clear from the PCoA/unconstrained RCM model).

If the variable is continuous, it should be possible to include it into the model without the alias problem. This would also alleviate the problem of overlapping samples scores. Also, did you share the full error message? I would expect it to say which variable is aliased. We should be able to solve this.

I will push the update later today.

sthawinke commented 4 years ago

@kdpaepe I pushed the new version, you can try playing around with the rowExp (and colExp) parameters to RCM() to see if this solves matters. I would try values between 0.5 (the default) and 1. Can you let me know if this worked?

kdpaepe commented 4 years ago

Many thanks for taking the time to consider these issues. Sorry for my late reply.

Thanks for the clarification. I am glad to be on the same page with Borcard that the wa-scores are hard to interpret. I pondered on whether to include this option (i.e. wa-scores) in the package, but decided against it because

  1. The biplot principle does not hold, and these plots have no mathematical interpretation
  2. Matters are further complicated by the fact that features in the unconstrained RC(M) model with linear response functions are arrows with unique origins.

The above is definitely true for environmental continuous variables. On the other hand, I am interested to see the effect of the design variables on the community (as such I did not measure the explanatory variables, but rather imposed those by design). Hence, I perfectly know which samples correspond to which combination of explanatory variables (by design) and plotting the sample scores (site scores) in function of the explanatory variables does not give me a lot of additional information. While with the wa scores, correct me if I'm wrong (I'm not an expert in this), I can get some idea of which species respond most to the implied design conditions, causing the samples to diverge and separate in the triplot? What I want to get out of this analysis is basically to which extent each design variable impacts the microbial community (e.g. proportion of the explained variance) and if it is a significant effect. Is there a way to derive these measures (quantification of impact and p-value) out of the RCM output?

You could do this yourself though, by extracting the species arrow coordinates with extractCoord(), and then calculate the weighted average of their starting and end points for every sample to obtain wa-scores (or rather wa-arrows), unique for each sample. Still as before, it is not clear what these plots really mean.

I understand the demand for unique sample scores, but it is a bit contradictory to the idea underlying constrained analysis, whereby samples are only discriminated through the sample variables provided. You may want to use the unconstrained plot and colour the samples by the most important variables perhaps?

The spike-in variable can be seen as a grouping of timepoints (pre and post spike-in) and it is this variable I am mainly interested in (together with some other variables), and not so much the continuous time variable (which is numeric and can take 15 different values). I would prefer to keep the spike-in variable in the model as to quantify its effect and significance (e.g. like in RDA with permutation tests and R2adj) on the community (which are clear from the PCoA/unconstrained RCM model).

If the variable is continuous, it should be possible to include it into the model without the alias problem. This would also alleviate the problem of overlapping samples scores. Also, did you share the full error message? I would expect it to say which variable is aliased. We should be able to solve this.

The complete message was as follows: Error in checkAlias(datFrame, covariatesNames) : Sample variables 'Time_h352' are aliased with other variables. Drop some sample-variables and try again.

The variables causing the issue are: ASV_2_metadata$Time_h [1] 0 16 16 16 16 16 16 40 40 40 40 40 40 64 64 64 64 64 64 88 88 88 88 88 88 112 112 112 112 112 112 136 136 136 136 136 [37] 136 160 160 160 160 160 160 167 167 167 167 167 167 184 184 184 184 184 184 208 208 208 208 208 208 232 232 232 232 232 232 256 256 256 256 256 [73] 256 280 280 280 280 280 280 304 304 304 304 304 304 328 328 328 328 328 328 352 352 352 352 352 352 359

ASV_2_metadata$Spike [1] FS pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre
[25] pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre pre post post post post post [49] post post post post post post post post post post post post post post post post post post post post post post post post [73] post post post post post post post post post post post post post post post post post post post post post post post post [97] post Spike Levels: FS pre Spike post

I will push the update later today.

kdpaepe commented 4 years ago

@kdpaepe I pushed the new version, you can try playing around with the rowExp (and colExp) parameters to RCM() to see if this solves matters. I would try values between 0.5 (the default) and 1. Can you let me know if this worked?

I have played around with the rowExp setting a bit, and here are my findings:

Could you give me any directions on how to optimize this value? Or is it just trial and error until it succeeds? I noticed that by default colExp=rowExp and I therefore started only providing the rowExp value. Would it be advisable to also set colExp?

sthawinke commented 4 years ago

Hence, I perfectly know which samples correspond to which combination of explanatory variables (by design) and plotting the sample scores (site scores) in function of the explanatory variables does not give me a lot of additional information.

I do think there is some information in the position of the samples, even if some overlap. E.g if factor A has little impact on microbiome composition, but factor B does, then samples with a different level of factor A tend to be close together, whereas samples with a different level of factor B are far apart. The additional information lies in the loadings of the environmental gradient, attaching more or less importance to the variables. Whether a sample variable is part of the design or measured is of no importance here.

While with the wa scores, correct me if I'm wrong (I'm not an expert in this), I can get some idea of which species respond most to the implied design conditions, causing the samples to diverge and separate in the triplot?

The relationship between environmental sample variables and species is the same for RDA and RCM: both indeed show how sensitive the species are to the environmental variables through the length of their projections (the biplot principle, see also the vignette) This holds independent of the type of samples scores, they do not even need to be plotted. The "LC" scores (the ones used by RCM) then reflect the separation between the samples because of the effects of the environmental variables on the species (i.e. the biplot principle between samples and species, as in the unconstrained model).

"WA" scores are averages of species coordinates, and as such all shrunk towards the center. I've been trying to wrap my head around them, but I cannot see how e.g. distances between samples or relations with the two other components should be interpreted

What I want to get out of this analysis is basically to which extent each design variable impacts the microbial community (e.g. proportion of the explained variance) and if it is a significant effect. Is there a way to derive these measures (quantification of impact and p-value) out of the RCM output?

The loadings of the environmental variables give a good idea of how important each of the environmental variables are, e.g. in your plot above this appears to be spike in and position rather than nutrients. Significance could be assessed via permutation (will be computation intensive), but this is currently not implemented, RCM is really for exploration.

The complete message was as follows: Error in checkAlias(datFrame, covariatesNames) : Sample variables 'Time_h352' are aliased with other variables. Drop some sample-variables and try again.

Your output suggests that the time is interpreted as a discrete variable. Can you try converting it explicitly with as.numeric()?

sthawinke commented 4 years ago

@kdpaepe I pushed the new version, you can try playing around with the rowExp (and colExp) parameters to RCM() to see if this solves matters. I would try values between 0.5 (the default) and 1. Can you let me know if this worked?

I have played around with the rowExp setting a bit, and here are my findings:

* Value 1 : Failure after 1 iteration => Error in nleqslv(fn = dNBpsis, x = psis[KK], theta = thetasMat, X = X,  :   initial value of fn function contains non-finite values (starting at index=1)
  Check initial x and/or correctness of function

* Value 0.9 and 0.8:  Failure after 32 iterations => Error in nleqslv(initParam[, i], reg = design, fn = dNBllcol_constr, theta = thetas[i],  :   non-finite value for `x[1]` supplied to function

* Value 0.7: Failure after 25 iterations

* Value 0.65: Failture after 21 iterations

* Value 0.61: Failure after 53 iterations

* Value 0.6: Failure after 141 iterations

* Value 0.59: Failure after 125 iterations

* Value 0.55: Failure after 22 iterations

* Value 0.5: Failure after 6 iterations

Could you give me any directions on how to optimize this value? Or is it just trial and error until it succeeds? I noticed that by default colExp=rowExp and I therefore started only providing the rowExp value. Would it be advisable to also set colExp?

Hmm this just seems like a difficult dataset, I do not have a soloution, but do try with the numeric time variable as above

sthawinke commented 4 years ago

Or try more stringent filtering of OTUs with many zeroes (prevCutOff argument)

kdpaepe commented 4 years ago

@kdpaepe I pushed the new version, you can try playing around with the rowExp (and colExp) parameters to RCM() to see if this solves matters. I would try values between 0.5 (the default) and 1. Can you let me know if this worked?

I have played around with the rowExp setting a bit, and here are my findings:

* Value 1 : Failure after 1 iteration => Error in nleqslv(fn = dNBpsis, x = psis[KK], theta = thetasMat, X = X,  :   initial value of fn function contains non-finite values (starting at index=1)
  Check initial x and/or correctness of function

* Value 0.9 and 0.8:  Failure after 32 iterations => Error in nleqslv(initParam[, i], reg = design, fn = dNBllcol_constr, theta = thetas[i],  :   non-finite value for `x[1]` supplied to function

* Value 0.7: Failure after 25 iterations

* Value 0.65: Failture after 21 iterations

* Value 0.61: Failure after 53 iterations

* Value 0.6: Failure after 141 iterations

* Value 0.59: Failure after 125 iterations

* Value 0.55: Failure after 22 iterations

* Value 0.5: Failure after 6 iterations

Could you give me any directions on how to optimize this value? Or is it just trial and error until it succeeds? I noticed that by default colExp=rowExp and I therefore started only providing the rowExp value. Would it be advisable to also set colExp?

Hmm this just seems like a difficult dataset, I do not have a soloution, but do try with the numeric time variable as above

With the rowExp 0.65 settings and numeric variable I was able to produce this plot:

image

Leaving out the time variable, produced an error message after 64 iterations but this worked at a value of 0.1 image

Leaving out the Spike variable but retaining the time effect, I get a results at a 0.1 value image

So it seems that adding in the rowExp is a valuable change, as I am now able to look at different combinations of variables using different settings of rowExp. Many thanks for this function update!

sthawinke commented 4 years ago

Great! I will leave the issue open as it may contain useful information for others.

kdpaepe commented 4 years ago

I have a follow-up question on this:

image When I extract the coordinates to create a customized ggplot2 version of the plot, I get the following results for the variables

$variables Dim1 Dim2 PD_colonDistal 0.350770628 -0.601956766 PD_colonProximal -0.350770628 0.601956766 Nutrient_conditionNutrient deprived -0.038320818 0.127784386 Nutrient_conditionNutrient limited 0.003701289 -0.123903637 Nutrient_conditionNutrient rich 0.034619530 -0.003880748 Time_h

Looking at the plot, this does not seem to correspond. Do I need to perform some kind of transformation to obtain the same values as in the automatically generated plot? Because now I get: image

Likewise, the taxon scores seem to differ a bit

sthawinke commented 4 years ago

Yes some transformation occurs inside the plot.RCM function to get the variable coordinates on the same approximate scale as the samples. You can scale these vectors starting from the origin at will without changing the interpretation, but do it equally for both dimensions. Also, make sure to use equal scales for both axes not to deform the plot (see coord_fixed()).

The species arrow lengths are scaled in the same way with respect to the samples.

kdpaepe commented 4 years ago

Yes some transformation occurs inside the plot.RCM function to get the variable coordinates on the same approximate scale as the samples. You can scale these vectors starting from the origin at will without changing the interpretation, but do it equally for both dimensions. Also, make sure to use equal scales for both axes not to deform the plot (see coord_fixed()).

The species arrow lengths are scaled in the same way with respect to the samples.

So I can just multiply species and variable scores by e.g. 10?

sthawinke commented 4 years ago

dataTax[, c("end1", "end2")] = dataTax[, c("origin1", "origin2")] + dataTax[, c("slope1", "slope2")] * scalingFactor

and then choose "scalingfactor" freely

kdpaepe commented 4 years ago
  • Variable scores: yes
  • Species arrows: no, the origin remains the same, but you can extend the arrows from there ($\neq$ just multiplying the coordinates of the end points). See the code of plot.RCM:

dataTax[, c("end1", "end2")] = dataTax[, c("origin1", "origin2")] + dataTax[, c("slope1", "slope2")] * scalingFactor

and then choose "scalingfactor" freely

Works like a charm image

I just multiplied the ends with a scaling factor: _'geom_segment(data=data.frame(ASV_RCM_2_lin_taxaforplot),aes(y=origin1,x=origin2,xend=end1scale_taxa,yend=end2scale_taxa,label=rownames(ASV_RCM_2_lintaxaforplot)),size=0.5,arrow=arrow(length = unit(0.1,"cm")))'

sthawinke commented 4 years ago

Hi, I am glad it worked. Yet my latex code does not seem to have compiled, please do not multiply the ends only; it changes the direction of the arrows! The code I provided is simple and should do

kdpaepe commented 4 years ago

Hello, sorry to disturb again. But I ran into some more issues when verifying assumptions.

_ASV_RCM_2_lin <- RCM(phyloseqtable_2,covariates=c("PD_colon","Nutrient_condition","Timeh"),responseFun = "linear", verbose=TRUE, allowMissingness=TRUE,nleqslv.control = list(maxit = 1e3L, cndtol = 1-18),rowExp=0.1)

_residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Pearson", numTaxa =15) residualPlot(ASV_RCM_2_lin, whichTaxa = "runs",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2lin, whichTaxa = "runs",resid = "Pearson", numTaxa =15)

Irrespective of which parameters I specify in the resisualPlot function, I get this error message: Error in par(parTmp) : invalid value specified for graphical parameter "pin"

Despite this error, plots do appear (see example below). image

As some patterns seem to be present, I would like to colour samples by factor levels (as in the publication). But including a samColour or samShape argument does not have any effect: _residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15,samColour = "Nutrient_condition",samShape="PDcolon")

Adittionally, when I try to colour samples by Influence, RCM throws the following error message _plot(ASV_RCM_2_lin, plotType = "samples", Influence=TRUE, samSize = 2.5, samColour="PDcolon") Error in NBalphaInfl(x, inflDim)[, , samColour] : subscript out of bounds

While, _plot(ASV_RCM_2_lin, plotType = "species", Influence=TRUE, samSize = 2.5, samColour="Nutrientcondition") does work. But I am not sure how to interpret the resulting plot: image

In the unconstrained case: _plot(ASV_RCM2, plotType = "samples", Influence=TRUE, samSize = 2.5) works too. But it is not very informative. Do I need to specify a design factor with samColour in the unconstrained case too? image

Furthermore, I was wondering what range of deviance values is acceptable. The values I obtained in the unconstrained case seem quite high compared to the ones in the example datasets. image

Finally, I was a bit confused as to what the deviance in the plots refers to (sum or mean of the sum of squares of the deviance residuals) when comparing the supplementary file of the paper versus the vignette, where I found the following 2 paragraphs:

2.4.3.1 Deviance residuals The deviance residuals dij of the negative binomial distribution are defined as (Zwilling 2013): Their sum of squares equals the total deviance per sample or taxon. We can visually represent the mean deviance per sample or taxon by colour codes.

whereas in the vignette I came across the following:

Also for constrained ordination it can be interesting to use the deviance residuals. We can use them in the unconstrained case by summing over taxa or samples, or plot them versus the environmental gradient to detect lack of fit for the shape of the response function.

sthawinke commented 4 years ago

I opened a new issue #15 for this question