interpretation of multivariate residuals

jakeberv commented 8 months ago

Hi Julien,

I was wondering if you could help me understand something about the residuals from the output of mvgls. If I do, for example, Y~1, where Y is a matrix of traits, and fit mvBM, I get out a matrix of residuals (one column per trait) and the option to generate normalized residuals. Is the interpretation of normalized residuals similar to that for 'phylogenetic' residuals? In a model like Y~1, these residuals represent values that are "corrected for" the inferred multivariate phylogenetic structure, under the specified model? Is this a correct interpretation?

Thanks, Jacob Berv

JClavel commented 8 months ago

Hi Jacob,

If you use « residuals() » on an object fit by the mvgls function, you will obtain by default the raw residuals. That is, the data minus the fitted values. If you use the “normalized” residuals (« residuals(fit, type=”normalized”) » then you’ll obtain residuals that have been standardized by the inverse square root of the variance-covariance matrix corresponding to the phylogenetic model. Assuming the process fitted is correct, the residuals for each species should be all independent with this transformation (but there are still correlations between residuals vectors-columns).

Note that for the “mvols” function, the “normalized” function can be used only if weights have been provided.

Regards,

Julien

jakeberv commented 8 months ago

Cool, thanks. So the normalized residuals from mvgls can be considered phylogenetically independent? What if, for example, I wanted to generate size-corrected residuals, considering the multivariate VCV. Would it work to take normalized residuals from mvgls(Y~size), given a model and tree (where Y is a matrix of traits)? Eg would this be similar to Revell's size-corrected residuals approach? https://onlinelibrary.wiley.com/doi/10.1111/j.1558-5646.2009.00804.x

JClavel commented 8 months ago

Hi Jacob,

It’s the « type » option that you should specify with “normalized”, see ?residuals.mvgls

I think there’s a bit of confusion here, Liam’s paper show that if you’re using the residuals from a model fit with GLS, then you should still use comparative methods on downstream analyses conducted on these residuals because they still contain phylogenetic signal (even if you used a phylogenetic model to compute them). This is because, we simply used the GLS rather than OLS estimate to compute the residuals, but we did not remove the covariances. The residuals in his paper correspond to type=”response” in mvgls (I’m using the same terminology as in the “nlme” package). They are the raw residuals from the GLS fit.

If you’re using the “normalized” residuals, then the phylogenetic signal has been removed (residuals are independent between species), but some downstream analyses may still need the use of comparative methods in some circumstances (depending on the relationships between some predictors and those transformed “data”). In fact, if you’re goal is to remove the effect of a covariate in a regression model (like testing the effect of some predictors after removing size), a better approach is to directly incorporate the covariate in the model and use appropriate tests (e.g., type I or II) to assess the effect of each predictor after accounting for the other. In the Supplementary Material (https://datadryad.org/stash/dataset/doi:10.5061/dryad.jsxksn052) of our paper https://doi.org/10.1093/sysbio/syaa010, we provide a tutorial to illustrate this approach. I would discourage using residuals as “data” when we can simply use the appropriate model design, this may lead to several biases.

Best wishes,

Julien

jakeberv commented 8 months ago

Sure this all makes sense. In my case I am trying to come up with a way of incorporating information about phylogeny into a downstream machine learning task (where there is otherwise no obvious way to do so). So, I have been experimenting with mvgls the idea of extracting phylogenetic residuals... Any thoughts in that context?

J

On Tue, Oct 24, 2023, 6:29 PM JClavel @.***> wrote:

Hi Jacob,

It’s the « type » option that you should specify with “normalized”, see ?residuals.mvgls

I think there’s a bit of confusion here, Liam’s paper show that if you’re using the residuals from a model fit with GLS, then you should still use comparative methods on downstream analyses conducted on these residuals because they still contain phylogenetic signal (even if you used a phylogenetic model to compute them). This is because, we simply used the GLS rather than OLS estimate to compute the residuals, but we did not remove the covariances. The residuals in his paper correspond to type=”response” in mvgls (I’m using the same terminology as in the “nlme” package). They are the raw residuals from the GLS fit.

If you’re using the “normalized” residuals, then the phylogenetic signal has been removed (residuals are independent between species), but some downstream analyses may still need the use of comparative methods in some circumstances (depending on the relationships between some predictors and those transformed “data”). In fact, if you’re goal is to remove the effect of a covariate in a regression model (like testing the effect of some predictors after removing size), a better approach is to directly incorporate the covariate in the model and use appropriate tests (e.g., type I or II) to assess the effect of each predictor after accounting for the other. In the Supplementary Material ( https://datadryad.org/stash/dataset/doi:10.5061/dryad.jsxksn052) of our paper https://doi.org/10.1093/sysbio/syaa010, we provide a tutorial to illustrate this approach. I would discourage using residuals as “data” when we can simply use the appropriate model design, this may lead to several biases.

Best wishes,

Julien

— Reply to this email directly, view it on GitHub https://github.com/JClavel/mvMORPH/issues/14#issuecomment-1778144167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23TQ7OOKSPESC42DSP3YBA6KRAVCNFSM6AAAAAA6MWXDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE2DIMJWG4 . You are receiving this because you authored the thread.Message ID: @.***>

JClavel commented 8 months ago

I have not enough information to advise correctly, but if you should maintain some phylogenetic information, then you should not use the standardized/normalized but the raw residuals, but if you want instead to have information with minimal phylogenetic signal, then you can use them.

Best wishes,

Julien

De : Jacob S. Berv @.> Envoyé : mercredi 25 octobre 2023 00:42 À : JClavel/mvMORPH @.> Cc : JClavel @.>; Comment @.> Objet : Re: [JClavel/mvMORPH] interpretation of multivariate residuals (Issue #14)

Sure this all makes sense. In my case I am trying to come up with a way of incorporating information about phylogeny into a downstream machine learning task (where there is otherwise no obvious way to do so). So, I have been experimenting with mvgls the idea of extracting phylogenetic residuals... Any thoughts in that context?

J

On Tue, Oct 24, 2023, 6:29 PM JClavel @.***> wrote:

Hi Jacob,

It’s the « type » option that you should specify with “normalized”, see ?residuals.mvgls

I think there’s a bit of confusion here, Liam’s paper show that if you’re using the residuals from a model fit with GLS, then you should still use comparative methods on downstream analyses conducted on these residuals because they still contain phylogenetic signal (even if you used a phylogenetic model to compute them). This is because, we simply used the GLS rather than OLS estimate to compute the residuals, but we did not remove the covariances. The residuals in his paper correspond to type=”response” in mvgls (I’m using the same terminology as in the “nlme” package). They are the raw residuals from the GLS fit.

If you’re using the “normalized” residuals, then the phylogenetic signal has been removed (residuals are independent between species), but some downstream analyses may still need the use of comparative methods in some circumstances (depending on the relationships between some predictors and those transformed “data”). In fact, if you’re goal is to remove the effect of a covariate in a regression model (like testing the effect of some predictors after removing size), a better approach is to directly incorporate the covariate in the model and use appropriate tests (e.g., type I or II) to assess the effect of each predictor after accounting for the other. In the Supplementary Material ( https://datadryad.org/stash/dataset/doi:10.5061/dryad.jsxksn052) of our paper https://doi.org/10.1093/sysbio/syaa010, we provide a tutorial to illustrate this approach. I would discourage using residuals as “data” when we can simply use the appropriate model design, this may lead to several biases.

Best wishes,

Julien

— Reply to this email directly, view it on GitHub https://github.com/JClavel/mvMORPH/issues/14#issuecomment-1778144167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23TQ7OOKSPESC42DSP3YBA6KRAVCNFSM6AAAAAA6MWXDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE2DIMJWG4 . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/JClavel/mvMORPH/issues/14#issuecomment-1778165003, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACSCJSID2UNTY2X2AZFZ2J3YBA76PAVCNFSM6AAAAAA6MWXDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE3DKMBQGM. You are receiving this because you commented.Message ID: @.***>

jakeberv commented 8 months ago

Interesting — in the case I mention I am trying to use a supervised machine learning approach to generate estimates of variable importance for a classification problem. So, I want to basically “subtract” the phylogenetic signal as much as possible from the input data so that the ranking of variable importance is less biased toward indicating phylogenetic signal. So it seems like for that case, using the normalized residuals from mvgls is the way to go.

In parallel, I am also interested in doing phylogenetic size correction, similar to Revell’s approach, but taking into account the multivariate VCV— for downstream applications like PhylogeneticEM, or analyses focused on shape variation (while trying to account for size variation). In this case, you suggest using the “response” type of residuals output from mvgls?

Thanks, Jake

On Oct 24, 2023, at 6:49 PM, JClavel @.***> wrote:

I have not enough information to advise correctly, but if you should maintain some phylogenetic information, then you should not use the standardized/normalized but the raw residuals, but if you want instead to have information with minimal phylogenetic signal, then you can use them.

Best wishes,

Julien

De : Jacob S. Berv @.> Envoyé : mercredi 25 octobre 2023 00:42 À : JClavel/mvMORPH @.> Cc : JClavel @.>; Comment @.> Objet : Re: [JClavel/mvMORPH] interpretation of multivariate residuals (Issue #14)

Sure this all makes sense. In my case I am trying to come up with a way of incorporating information about phylogeny into a downstream machine learning task (where there is otherwise no obvious way to do so). So, I have been experimenting with mvgls the idea of extracting phylogenetic residuals... Any thoughts in that context?

J

On Tue, Oct 24, 2023, 6:29 PM JClavel @.***> wrote:

Hi Jacob,

It’s the « type » option that you should specify with “normalized”, see ?residuals.mvgls

I think there’s a bit of confusion here, Liam’s paper show that if you’re using the residuals from a model fit with GLS, then you should still use comparative methods on downstream analyses conducted on these residuals because they still contain phylogenetic signal (even if you used a phylogenetic model to compute them). This is because, we simply used the GLS rather than OLS estimate to compute the residuals, but we did not remove the covariances. The residuals in his paper correspond to type=”response” in mvgls (I’m using the same terminology as in the “nlme” package). They are the raw residuals from the GLS fit.

If you’re using the “normalized” residuals, then the phylogenetic signal has been removed (residuals are independent between species), but some downstream analyses may still need the use of comparative methods in some circumstances (depending on the relationships between some predictors and those transformed “data”). In fact, if you’re goal is to remove the effect of a covariate in a regression model (like testing the effect of some predictors after removing size), a better approach is to directly incorporate the covariate in the model and use appropriate tests (e.g., type I or II) to assess the effect of each predictor after accounting for the other. In the Supplementary Material ( https://datadryad.org/stash/dataset/doi:10.5061/dryad.jsxksn052) of our paper https://doi.org/10.1093/sysbio/syaa010, we provide a tutorial to illustrate this approach. I would discourage using residuals as “data” when we can simply use the appropriate model design, this may lead to several biases.

Best wishes,

Julien

— Reply to this email directly, view it on GitHub https://github.com/JClavel/mvMORPH/issues/14#issuecomment-1778144167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23TQ7OOKSPESC42DSP3YBA6KRAVCNFSM6AAAAAA6MWXDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE2DIMJWG4 . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/JClavel/mvMORPH/issues/14#issuecomment-1778165003, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACSCJSID2UNTY2X2AZFZ2J3YBA76PAVCNFSM6AAAAAA6MWXDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE3DKMBQGM. You are receiving this because you commented.Message ID: @.***> — Reply to this email directly, view it on GitHub https://github.com/JClavel/mvMORPH/issues/14#issuecomment-1778175035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23XXGP3RBI5QEFZEQF3YBBAY7AVCNFSM6AAAAAA6MWXDDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE3TKMBTGU. You are receiving this because you authored the thread.

JClavel / mvMORPH

interpretation of multivariate residuals #14