Evolutionary model for imputation is based on model estimation on missing dataset?

ericgoolsby / Rphylopars

Phylogenetic Comparative Tools for Missing Data and Within-Species Variation

28 stars 11 forks source link

Evolutionary model for imputation is based on model estimation on missing dataset? #38

Closed Jigyasa3 closed 3 years ago

Jigyasa3 commented 4 years ago

Hey @r03ert0 @ericgoolsby @tdjames1

Thank you for a detailed explanation of the method in the 2016 paper, and tutorials to run the functions in the package. I have a doubt if that's okay.. In Example 3 and Example 4, phylogenetic signal and evolutionary models are calculated on the missing dataset. Does that mean that if I want to impute missing data based on an evolutionary model, I should run these models on the missing dataset, and then find the "best" model for imputation? That sounds counter-intuitive.

Looking forward to your reply!

ericgoolsby commented 4 years ago

Hi there, and thank you for your message! Imputations should be based on whichever model is best-supported. For example, if model="lambda" is best supported, then your imputations should be based on model="lambda" as well. If "BM" is best supported, then base your imputations off of model="BM" (etc). Does that make sense?

Rphylopars would benefit from more thorough documentation and tutorials, which will hopefully be included in the next CRAN update in a few months. In the meantime, I'm happy to answer any additional questions!

I would also highly recommend Luke Harmon's free open-source book on phylogenetic comparative methods. Here are links to the online version and the book PDF. If you have institutional library access to Springer, you might be able to access "Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology" as well -- another great reference.

Jigyasa3 commented 4 years ago

Thank you for replying @ericgoolsby ! I will check these resources out, thank you!

I have another question, in the examples, the model selection and imputation is based on complete dataset. I am imputing data consisting of relative abundance of gene-families. Not all gene-families are correlated with each other, and individually follow a different evolutionary rate model. Does that mean I should impute data one column ( i.e. gene-family) at a time?

ericgoolsby commented 4 years ago

In general, I would suggest making imputations based on traits that are correlated with one another, and run traits that aren't correlated in separate individual runs (unfortunately in Rphylopars you can't fine-tune which traits have correlations and which are not). If you know which traits are correlated, you could subset your data to just those traits and run separate models and imputations for each cluster of traits.

You might also consider checking out the mvMORPH package, as it has a bit more flexibility with constraints on trait covariance and can handle a much wider selection of models

Jigyasa3 commented 4 years ago

Thank you for replying! I have a follow-up question regarding no. of columns (ie. gene-families) that can be imputed together. I have gene-families that are correlated. But if there are more than 10 columns, I get an error. I opened a new issue for it.