Caetanods / ratematrix

Bayesian estimation of the evolutionary rate matrix.
9 stars 3 forks source link

Correlations between on trait vector and 100 other vectors #51

Open bioinfowheat opened 3 years ago

bioinfowheat commented 3 years ago

Dear Daniel and Luke,

I'm trying to look at how host breadth of tip taxa (numerical value, ranging from 1 to 15) relates to the number of genes in a gene family each tip taxa has, and I have these values for 100 gene families. Thus, for a given gene family, I want to ask if the there a positive or negative correlation between host breadth and number of genes, among these species. Ideally, I'd get an estimate of the correlation, something akin to R2 and a measure of significance. Eventually I'll have 100 of these, so I'll be looking at a histogram of correlation values.

I thought RateMatrix could help me do this, but it seems to want a categorical variable for comparing the correlations and I don't know how to get around that (I tried your tutorial). Perhaps this is not the best way forward with my ... issue.

thanks in advance, Chris

Caetanods commented 3 years ago

Hello Chris! I just saw your comment today. Super sorry for that. Apparently, GitHub is not sending me email notifications when an issue is posted here and I do not check the website often.

The ratematrix method is using a categorical predictor as a rate regime. Two things are complicated in your question, with respect to using ratematrix. The first is that you have count data (number of hosts). The Brownian motion model (as well as other continuous trait models such as the OU, EB, and ACDC) is expecting truly continuous data that can be modeled using a known distribution, such as a multivariate normal. Depending on how your count data is distributed, using one of these models might not be adequate. You can try to transform the data to make it continuous. If you go this route, then I suggest that you conduct some model adequacy tests which can be done using Pennell et al. (https://www.journals.uchicago.edu/doi/10.1086/682022) approach. They have a R package to make this analysis which is still working (https://github.com/mwpennell/arbutus).

The second issue is the predictor variable. If you have not that many counts of number of genes, then you could model it as discrete, however, I think you have a large variance of number of genes, which would make the ratematrix approach not ideal.

One alternative would be to treat both quantities as continuous and perform a correlation analysis using pgls models (or similar). You have many (100) gene families to correlate with the same response variable, so you might need to take into account multiple tests, that will heavily penalize your tests. Well, if you start with a descriptive approach you could try a pPCA analysis to investigate which gene families better covariate with host breadth (checking the loadings on the axes of variation).

I am so sorry it took forever to see your message here. Please send me an email if that happens again (caetanods1 on the gmail dot com)

Daniel