CenterForStatistics-UGent / RCM

A model-based visualization method for microbiome data
16 stars 2 forks source link

Diagnostics #15

Closed sthawinke closed 3 years ago

sthawinke commented 4 years ago

Hello, sorry to disturb again. But I ran into some more issues when verifying assumptions.

_ASV_RCM_2_lin <- RCM(phyloseqtable_2,covariates=c("PD_colon","Nutrient_condition","Timeh"),responseFun = "linear", verbose=TRUE, allowMissingness=TRUE,nleqslv.control = list(maxit = 1e3L, cndtol = 1-18),rowExp=0.1)

_residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Pearson", numTaxa =15) residualPlot(ASV_RCM_2_lin, whichTaxa = "runs",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2lin, whichTaxa = "runs",resid = "Pearson", numTaxa =15)

Irrespective of which parameters I specify in the resisualPlot function, I get this error message: Error in par(parTmp) : invalid value specified for graphical parameter "pin"

Despite this error, plots do appear (see example below). image

As some patterns seem to be present, I would like to colour samples by factor levels (as in the publication). But including a samColour or samShape argument does not have any effect: _residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15,samColour = "Nutrient_condition",samShape="PDcolon")

Adittionally, when I try to colour samples by Influence, RCM throws the following error message _plot(ASV_RCM_2_lin, plotType = "samples", Influence=TRUE, samSize = 2.5, samColour="PDcolon") Error in NBalphaInfl(x, inflDim)[, , samColour] : subscript out of bounds

While, _plot(ASV_RCM_2_lin, plotType = "species", Influence=TRUE, samSize = 2.5, samColour="Nutrientcondition") does work. But I am not sure how to interpret the resulting plot: image

In the unconstrained case: _plot(ASV_RCM2, plotType = "samples", Influence=TRUE, samSize = 2.5) works too. But it is not very informative. Do I need to specify a design factor with samColour in the unconstrained case too? image

Furthermore, I was wondering what range of deviance values is acceptable. The values I obtained in the unconstrained case seem quite high compared to the ones in the example datasets. image

Finally, I was a bit confused as to what the deviance in the plots refers to (sum or mean of the sum of squares of the deviance residuals) when comparing the supplementary file of the paper versus the vignette, where I found the following 2 paragraphs:

2.4.3.1 Deviance residuals The deviance residuals dij of the negative binomial distribution are defined as (Zwilling 2013): Their sum of squares equals the total deviance per sample or taxon. We can visually represent the mean deviance per sample or taxon by colour codes.

whereas in the vignette I came across the following:

Also for constrained ordination it can be interesting to use the deviance residuals. We can use them in the unconstrained case by summing over taxa or samples, or plot them versus the environmental gradient to detect lack of fit for the shape of the response function.

Originally posted by @kdpaepe in https://github.com/CenterForStatistics-UGent/RCM/issues/13#issuecomment-623114661

sthawinke commented 4 years ago

First of all congrats for checking the assumptions so carefully! Sorry for the many bugs, that was not the best tested part of the package. Let me know when you encounter further problems.

Hello, sorry to disturb again. But I ran into some more issues when verifying assumptions.

_ASV_RCM_2_lin <- RCM(phyloseqtable_2,covariates=c("PD_colon","Nutrient_condition","Timeh"),responseFun = "linear", verbose=TRUE, allowMissingness=TRUE,nleqslv.control = list(maxit = 1e3L, cndtol = 1-18),rowExp=0.1)

_residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Pearson", numTaxa =15) residualPlot(ASV_RCM_2_lin, whichTaxa = "runs",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2lin, whichTaxa = "runs",resid = "Pearson", numTaxa =15)

Irrespective of which parameters I specify in the resisualPlot function, I get this error message: Error in par(parTmp) : invalid value specified for graphical parameter "pin"

This command just restores the old par() settings after making the plot. I find it weird that it throws an error, but this is more of an R issue, perhaps check ?par. Can you run parTmp = par(no.readonly = TRUE); par(parTmp)

Despite this error, plots do appear (see example below). image

As some patterns seem to be present, I would like to colour samples by factor levels (as in the publication). But including a samColour or samShape argument does not have any effect: _residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15,samColour = "Nutrient_condition",samShape="PDcolon")

I guess this was due to the par() error above, it should be fixed now.

Adittionally, when I try to colour samples by Influence, RCM throws the following error message _plot(ASV_RCM_2_lin, plotType = "samples", Influence=TRUE, samSize = 2.5, samColour="PDcolon") Error in NBalphaInfl(x, inflDim)[, , samColour] : subscript out of bounds

While, _plot(ASV_RCM_2_lin, plotType = "species", Influence=TRUE, samSize = 2.5, samColour="Nutrientcondition") does work.

I consider this a bug, which should be patched now. You can still use plotType = "samples", but the name of the variable should refer to a dummy, something like "PD_colondistal" in your case. A more informative error should be thrown now anyway. Of course, for a binary variable it does not matter which dummy you pick, they are each other's mirror image. The syntax has been modified (you know use the 'inflVar' parameter to avoid confusion), so check the man pages again

But I am not sure how to interpret the resulting plot: image

If you try again plotting the samples it will show you the influence (in colour) of each of the samples on the variable which you will need to pick.

In the unconstrained case: _plot(ASV_RCM2, plotType = "samples", Influence=TRUE, samSize = 2.5) works too. But it is not very informative. Do I need to specify a design factor with samColour in the unconstrained case too? image

This was another bug, it should work now. It will colour the samples by total absolute influence on the psi parameter (put samColour = "Influence"). In the unconstrained case there is no clear other candidate for a parameter on which the influence needs to be shown.

Furthermore, I was wondering what range of deviance values is acceptable. The values I obtained in the unconstrained case seem quite high compared to the ones in the example datasets. image

I would not worry about this, the point is to find trends within your dataset, these values are not really comparable. A larger deviance may point to a lack of fit, but also to a noisier (or a larger) dataset, so it is really hard. What matters are patterns, and that is what you see here: the samples on the left have much poorer fits, which points to a defect in the model. Also from the other residual plots above, this suggests that the linear assumption for the response of the species to the gradient (made by e.g. RDA as well) is questionable. You may want to compare with the nonparametric response function.

Finally, I was a bit confused as to what the deviance in the plots refers to (sum or mean of the sum of squares of the deviance residuals) when comparing the supplementary file of the paper versus the vignette, where I found the following 2 paragraphs:

2.4.3.1 Deviance residuals The deviance residuals dij of the negative binomial distribution are defined as (Zwilling 2013): Their sum of squares equals the total deviance per sample or taxon. We can visually represent the mean deviance per sample or taxon by colour codes.

whereas in the vignette I came across the following:

Also for constrained ordination it can be interesting to use the deviance residuals. We can use them in the unconstrained case by summing over taxa or samples, or plot them versus the environmental gradient to detect lack of fit for the shape of the response function.

The package shows the sum of the squared deviance residuals, but as the total number is equal in all cases it does not really matter. Only relative measures are of importance.

kdpaepe commented 4 years ago

Thanks a lot for the help! A follow-up on these issues:

Hello, sorry to disturb again. But I ran into some more issues when verifying assumptions. _ASV_RCM_2_lin <- RCM(phyloseqtable_2,covariates=c("PD_colon","Nutrient_condition","Timeh"),responseFun = "linear", verbose=TRUE, allowMissingness=TRUE,nleqslv.control = list(maxit = 1e3L, cndtol = 1-18),rowExp=0.1) _residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Pearson", numTaxa =15) residualPlot(ASV_RCM_2_lin, whichTaxa = "runs",resid = "Deviance", numTaxa = 15) residualPlot(ASV_RCM_2lin, whichTaxa = "runs",resid = "Pearson", numTaxa =15) Irrespective of which parameters I specify in the resisualPlot function, I get this error message: Error in par(parTmp) : invalid value specified for graphical parameter "pin"

This command just restores the old par() settings after making the plot. I find it weird that it throws an error, but this is more of an R issue, perhaps check ?par. Can you run parTmp = par(no.readonly = TRUE); par(parTmp)

Despite this error, plots do appear (see example below). image As some patterns seem to be present, I would like to colour samples by factor levels (as in the publication). But including a samColour or samShape argument does not have any effect: _residualPlot(ASV_RCM_2_lin, whichTaxa = "response",resid = "Deviance", numTaxa = 15,samColour = "Nutrient_condition",samShape="PDcolon")

I guess this was due to the par() error above, it should be fixed now.

The par error is resolved. However, the colour coding by design factors still does not work.

Adittionally, when I try to colour samples by Influence, RCM throws the following error message _plot(ASV_RCM_2_lin, plotType = "samples", Influence=TRUE, samSize = 2.5, samColour="PDcolon") Error in NBalphaInfl(x, inflDim)[, , samColour] : subscript out of bounds While, _plot(ASV_RCM_2_lin, plotType = "species", Influence=TRUE, samSize = 2.5, samColour="Nutrientcondition") does work.

I consider this a bug, which should be patched now. You can still use plotType = "samples", but the name of the variable should refer to a dummy, something like "PD_colondistal" in your case. A more informative error should be thrown now anyway. Of course, for a binary variable it does not matter which dummy you pick, they are each other's mirror image. The syntax has been modified (you know use the 'inflVar' parameter to avoid confusion), so check the man pages again

I now tried: _plot(ASV_RCM_2_lin, plotType = "samples", Influence=TRUE, samSize = 2.5, inflVar="PD_colonDistal Colon",samShape="PDcolon")

However, the plot has no scale bar: image

and all points have a uniform colour? Either the influence is the same for every sample or something went wrong. Btw, I checked and it is not related to the space in the factor level names (I know its bad practice but it did not matter for any of my other applications and its easier for plotting) - see below without space.

image

But I am not sure how to interpret the resulting plot: image

If you try again plotting the samples it will show you the influence (in colour) of each of the samples on the variable which you will need to pick.

In the unconstrained case: _plot(ASV_RCM2, plotType = "samples", Influence=TRUE, samSize = 2.5) works too. But it is not very informative. Do I need to specify a design factor with samColour in the unconstrained case too? image

This was another bug, it should work now. It will colour the samples by total absolute influence on the psi parameter (put samColour = "Influence"). In the unconstrained case there is no clear other candidate for a parameter on which the influence needs to be shown.

I have now tried with: _plot(ASV_RCM2, plotType = "samples", samColour="Influence", samSize = 2.5) But I get the following error message:
_Error in get_variable(x$physeq, samColour) : Your phyloseq data object does not have a sample-data component Try ?sampledata for more details.

It is, however, unconstrained, which is the reason why I have not specified a sample_data component. With 'deviance' as SamColour the above works fine.

Furthermore, I was wondering what range of deviance values is acceptable. The values I obtained in the unconstrained case seem quite high compared to the ones in the example datasets. image

I would not worry about this, the point is to find trends within your dataset, these values are not really comparable. A larger deviance may point to a lack of fit, but also to a noisier (or a larger) dataset, so it is really hard. What matters are patterns, and that is what you see here: the samples on the left have much poorer fits, which points to a defect in the model.

So it means that the negative binomial model is not suitable or the departure is not well modelled in a reduced dimension. But would this be resolved by using a non-parametric response in the unconstrained case as the negative binomial is still used for the independence model? Can I then still produce interpretable triplots? Because that was the aim of this unconstrained analysis. I tried with the non-parametric option, but it took for ages and I interrupted the process since I first want to make sure the non-parametric option is useful in the unconstrained case? The analysis results do make sense (correspond with what I see in the PcoA or just in the raw data and what I biologically expect) despite the trend in deviance.

Also from the other residual plots above, this suggests that the linear assumption for the response of the species to the gradient (made by e.g. RDA as well) is questionable. You may want to compare with the nonparametric response function.

With a non-parametric response function I obtain this pattern (for the props data a similar pattern was observed in the supplementary file of the paper and you conclude there that a linear function might be preferable?). image

I performed distance based redundancy analysis (using pcoa and not pca scores as input and on a distance metric). Hence the input data is different from the raw counts used by RCM. I verified the homogeneity of variance assumption with betadisper for the RDA. It was met for Colon region and Nutrient condition, but not for Time. I therefore left the time compound out of the model for the anova (as I was not so much interested in this compound anyway). Which might be the answer why the deviance is lower to the right of the plot: Because: image image As you can see the earlier timepoints have smaller dispersions (biologically makes sense because the reactor is stabilizing in the beginning starting from one and the same inoculum used to inoculate all the colon vessels, and after a few HRTs the community adapts to the different conditions - eg develops into a proximal versus distal community and adapts to the differences in nutrient conditions. This evolution is nicely shown in the pcoa and (un)constrained RCM plots). The timepoint 1 and 15 are labeled and you see the communities diverging.

image

I think I can based on the unconstrained analysis show that it makes sense to skip the first 7 timepoints and work with the stabilized communities to assess the effects of colon region and nutrient condition in the constrained cases. This might allow me to use the linear response functions in the RCM model.

Finally, I was a bit confused as to what the deviance in the plots refers to (sum or mean of the sum of squares of the deviance residuals) when comparing the supplementary file of the paper versus the vignette, where I found the following 2 paragraphs: 2.4.3.1 Deviance residuals The deviance residuals dij of the negative binomial distribution are defined as (Zwilling 2013): Their sum of squares equals the total deviance per sample or taxon. We can visually represent the mean deviance per sample or taxon by colour codes. whereas in the vignette I came across the following: Also for constrained ordination it can be interesting to use the deviance residuals. We can use them in the unconstrained case by summing over taxa or samples, or plot them versus the environmental gradient to detect lack of fit for the shape of the response function.

The package shows the sum of the squared deviance residuals, but as the total number is equal in all cases it does not really matter. Only relative measures are of importance.

What do you mean by all cases?

Finally, I have one more question. Is it possible to look at interactions in the RCM model. And how do you decide whether or not to include interaction terms and how to interpret main effects when interactions might be present?

sthawinke commented 4 years ago

The par error is resolved. However, the colour coding by design factors still does not work.

I added colour to this type of plot as well, but without legend, as these default R-plots are hard to program. So either you add the legend manually (or make the plots yourself, I exported getDevianceRes for this purpose), or make the plots one taxon at the time, in which case you get ggplots.

I now tried: _plot(ASV_RCM_2_lin, plotType = "samples", Influence=TRUE, samSize = 2.5, inflVar="PD_colonDistal Colon",samShape="PDcolon")

However, the plot has no scale bar: image

and all points have a uniform colour? Either the influence is the same for every sample or something went wrong. Btw, I checked and it is not related to the space in the factor level names (I know its bad practice but it did not matter for any of my other applications and its easier for plotting) - see below without space.

I have now tried with: _plot(ASV_RCM2, plotType = "samples", samColour="Influence", samSize = 2.5) But I get the following error message: _Error in get_variable(x$physeq, samColour) : Your phyloseq data object does not have a sample-data component Try ?sampledata for more details.

It is, however, unconstrained, which is the reason why I have not specified a sample_data component. With 'deviance' as SamColour the above works fine.

I am sorry about that, see the updated vignette. Note that the influence is always calculated with respect to some variable. In the unconstrained case this choice is not so clear, so I suggest you set inflVar = "psi".

So it means that the negative binomial model is not suitable or the departure is not well modelled in a reduced dimension. But would this be resolved by using a non-parametric response in the unconstrained case as the negative binomial is still used for the independence model? Can I then still produce interpretable triplots? Because that was the aim of this unconstrained analysis. I tried with the non-parametric option, but it took for ages and I interrupted the process since I first want to make sure the non-parametric option is useful in the unconstrained case? The analysis results do make sense (correspond with what I see in the PcoA or just in the raw data and what I biologically expect) despite the trend in deviance.

Below you provide a good explanation, so the deviance plot has served its purpose, to help you better understand the data and its patterns. Such clearly linear trend indeed points to some problem with a variable rather than lack of fit.

The NB outcome distribution is used in all forms of the RCM model. "Non-parametric" refers to the species response function, as an alternative to "linear". It can be used to provide a better fit, but does not allow to make triplots, so if you can use the linear model as you did that is preferable

With a non-parametric response function I obtain this pattern (for the props data a similar pattern was observed in the supplementary file of the paper and you conclude there that a linear function might be preferable?).

This indeed suggests that the linear response function is acceptable. So the problem may come from elsewhere (confounders, lack of fit to the NB), but I would not worry too much about it

I performed distance based redundancy analysis (using pcoa and not pca scores as input and on a distance metric). Hence the input data is different from the raw counts used by RCM. I verified the homogeneity of variance assumption with betadisper for the RDA. It was met for Colon region and Nutrient condition, but not for Time. I therefore left the time compound out of the model for the anova (as I was not so much interested in this compound anyway). Which might be the answer why the deviance is lower to the right of the plot: Because: image image As you can see the earlier timepoints have smaller dispersions (biologically makes sense because the reactor is stabilizing in the beginning starting from one and the same inoculum used to inoculate all the colon vessels, and after a few HRTs the community adapts to the different conditions - eg develops into a proximal versus distal community and adapts to the differences in nutrient conditions. This evolution is nicely shown in the pcoa and (un)constrained RCM plots). The timepoint 1 and 15 are labeled and you see the communities diverging.

image

I think I can based on the unconstrained analysis show that it makes sense to skip the first 7 timepoints and work with the stabilized communities to assess the effects of colon region and nutrient condition in the constrained cases. This might allow me to use the linear response functions in the RCM model.

Your explanation sounds sensible, but I cannot add much to this. Notice also how RCM better separates solid from dotted lines in the second dimension, whereas PCoA separates the colours

The package shows the sum of the squared deviance residuals, but as the total number is equal in all cases it does not really matter. Only relative measures are of importance.

What do you mean by all cases?

For a sample, the total number of terms summed is the number of taxa, which is equal for each sample. For that reason, using the sum or the mean does not matter when comparing samples.

Finally, I have one more question. Is it possible to look at interactions in the RCM model. And how do you decide whether or not to include interaction terms and how to interpret main effects when interactions might be present?

You can add interactions in the constrained models, although you need to create the (dummy) variables manually and add them to the sample data object (yet this may lead to aliasing). There is no significance testing to decide this, just explore the different models. The interpretation of interactions is the same as e.g. in linear regression.

kdpaepe commented 4 years ago

Thanks a lot for the detailed answers. One more issue remains: If I try: _plot(RCM_ASV_2_constrlin, plotType = "samples", Influence=TRUE, inflVar="psi") I get: Error in rcm$alpha[, Dim] %*% rcm$covariates : non-conformable arguments

In the unconstrained case this works (although there is no colour gradient, which probably just indicates that there are no outlier samples): plot(RCM_genus_2_lin, plotType = "samples", Inflvar = "psi", samSize = 2.5) image

However, now in the unconstrained case colouring by deviance does not work anymore: _plot(RCM_genus_2lin, plotType = "samples", samColour = "Deviance", samSize = 2.5) And produces the following error: Error in access(object, "sam_data", errorIfNULL) : sam_data slot is empty.

sthawinke commented 4 years ago

Thanks a lot for the detailed answers. One more issue remains: If I try: _plot(RCM_ASV_2_constrlin, plotType = "samples", Influence=TRUE, inflVar="psi") I get: Error in rcm$alpha[, Dim] %*% rcm$covariates : non-conformable arguments

This has been fixed

In the unconstrained case this works (although there is no colour gradient, which probably just indicates that there are no outlier samples): plot(RCM_genus_2_lin, plotType = "samples", Inflvar = "psi", samSize = 2.5)

This works for me, but try with inflVar rather than Inflvar?

However, now in the unconstrained case colouring by deviance does not work anymore: _plot(RCM_genus_2lin, plotType = "samples", samColour = "Deviance", samSize = 2.5) And produces the following error: Error in access(object, "sam_data", errorIfNULL) : sam_data slot is empty.

I think this should be fixed too. Note that there is no harm in providing a phyloseq object with sam_data slot for an unconstrained analysis, as long as you do not specify covariates =...

kdpaepe commented 4 years ago

Everything works fine now. Many thanks for the support!

Thanks a lot for the detailed answers. One more issue remains: If I try: _plot(RCM_ASV_2_constrlin, plotType = "samples", Influence=TRUE, inflVar="psi") I get: Error in rcm$alpha[, Dim] %*% rcm$covariates : non-conformable arguments

This has been fixed

In the unconstrained case this works (although there is no colour gradient, which probably just indicates that there are no outlier samples): plot(RCM_genus_2_lin, plotType = "samples", Inflvar = "psi", samSize = 2.5)

This works for me, but try with inflVar rather than Inflvar?

However, now in the unconstrained case colouring by deviance does not work anymore: _plot(RCM_genus_2lin, plotType = "samples", samColour = "Deviance", samSize = 2.5) And produces the following error: Error in access(object, "sam_data", errorIfNULL) : sam_data slot is empty.

I think this should be fixed too. Note that there is no harm in providing a phyloseq object with sam_data slot for an unconstrained analysis, as long as you do not specify covariates =...