Open nick-youngblut opened 4 years ago
If I include another, fully continuous trait, I get the error:
Error in data.frame(species = trait_data$species, imputed$recon_ind): arguments imply differing number of rows: 185, 0
Traceback:
1. phylopars(trait_data = trait_df, tree = host_tree)
2. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
3. data.frame(species = trait_data$species, imputed$recon_ind)
4. stop(gettextf("arguments imply differing number of rows: %s",
. paste(unique(nrows), collapse = ", ")), domain = NA)
...but if I leave out the 3-level factor trait and just include the "species" column and the continuous trait column, then phylopars works correctly.
I've been playing around with this, and it seems that a lot of various inputs that include multi-level factor traits (converted to numeric) and continuous variables will generate the error:
Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not a matrix.
Traceback:
1. phylopars(trait_data = trait_df,
. tree = host_tree)
2. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
3. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1,
. pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind,
. ind_edge = ind_edge, parent_edges = parent_edges, pars = pars,
. nvar = nvar, phylocov_diag = as.integer(!phylo_correlated),
. nind = nind, nob = nob, nspecies = nspecies, nedge = nedge,
. anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset,
. un_species_subset = un_species_subset, subset_list = subset_list,
. ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind,
. fixed_mu = mu, ret_level = 3, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]),
. phylocov_fixed = phylocov_fixed, is_phenocov_list = 0, phenocov_list = list(),
. is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]),
. phenocov_fixed = phenocov_fixed, OU_len = list())
...while some multi-level traits do work. It's really unclear why some work and some generate this error.
I am trying to run phylopars()
on a set of continuous traits. A preview of my input data.frame:
species V2
s__Acetatifactor_sp900066365 s__Acetatifactor_sp900066365 0.0007380356
s__Acetatifactor_sp900066565 s__Acetatifactor_sp900066565 0.0005074156
s__Agathobacter_faecis s__Agathobacter_faecis 0.0025116876
s__Agathobacter_rectalis s__Agathobacter_rectalis 0.0131294618
s__Agathobacter_sp900317585 s__Agathobacter_sp900317585 0.0031340672
V3 V4 V5 V6
s__Acetatifactor_sp900066365 0.001421438 0.0006370219 0.0005877497 0.001166753
s__Acetatifactor_sp900066565 0.001567913 0.0008604320 0.0008487620 0.001347186
s__Agathobacter_faecis 0.004484746 0.0025899966 0.0017205411 0.003644285
s__Agathobacter_rectalis 0.009425716 0.0141831984 0.0032517141 0.008315006
s__Agathobacter_sp900317585 0.002460970 0.0032725461 0.0008395742 0.002190721
V7 V8 V9 V10
s__Acetatifactor_sp900066365 0.0005121238 0.0002571268 0.001762162 0.0007861735
s__Acetatifactor_sp900066565 0.0003313521 0.0004847163 0.001853193 0.0102120276
s__Agathobacter_faecis 0.0011752616 0.0001254030 0.005622542 0.0002754917
s__Agathobacter_rectalis 0.0043995029 0.0010882833 0.011637338 0.0042903712
s__Agathobacter_sp900317585 0.0014388840 0.0002850596 0.003105409 0.0007939048
V11 V12 V13
s__Acetatifactor_sp900066365 0.0004411348 0.0005846794 0.0006182063
s__Acetatifactor_sp900066565 0.0007553928 0.0005286972 0.0050530630
s__Agathobacter_faecis 0.0182341994 0.0031211565 0.0008389246
s__Agathobacter_rectalis 0.0152277543 0.0114247042 0.0031311814
s__Agathobacter_sp900317585 0.0028675800 0.0021177798 0.0006213883
V14 V15 V16 V17
s__Acetatifactor_sp900066365 0.0001938570 0.000818661 0.0001200000 0.0002485115
s__Acetatifactor_sp900066565 0.0005943459 0.003482738 0.0007528935 0.0009862291
s__Agathobacter_faecis 0.0004640492 0.002320781 0.0001594738 0.0001920727
s__Agathobacter_rectalis 0.0101299631 0.055806823 0.0043490765 0.0012795696
s__Agathobacter_sp900317585 0.0032585503 0.012220729 0.0011372366 0.0002485115
V18 V19 V20
s__Acetatifactor_sp900066365 0.001548092 0.0004694376 0.005377131
s__Acetatifactor_sp900066565 0.001314906 0.0037912056 0.002098138
s__Agathobacter_faecis 0.003875061 0.0021836612 0.006754514
s__Agathobacter_rectalis 0.009067108 0.0142452002 0.008615715
s__Agathobacter_sp900317585 0.002191376 0.0034489308 0.002204897
Dims of my full data.frame: 135 97
If I provide the entire data.frame to phylopars()
(default params), I get:
Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not compatible with requested type: [type=character; target=double].
Traceback:
1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa, tree = tree) # at line 120-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
4. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1,
. pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind,
. ind_edge = ind_edge, parent_edges = parent_edges, pars = pars,
. nvar = nvar, phylocov_diag = as.integer(!phylo_correlated),
. nind = nind, nob = nob, nspecies = nspecies, nedge = nedge,
. anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset,
. un_species_subset = un_species_subset, subset_list = subset_list,
. ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind,
. fixed_mu = matrix(0), ret_level = 1, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]),
. phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list),
. phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]),
. phenocov_fixed = phenocov_fixed, OU_len = list())
If just I provide the first 3 columns of the data.frame:
Error in if (ll2 < ll) {: missing value where TRUE/FALSE needed
Traceback:
1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa[, 1:3], tree = tree, pheno_error = FALSE,
. phylo_correlated = TRUE) # at line 118-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
If just I provide the first 10 columns of the data.frame:
Error in tp(L = X_complete, R = as.matrix(as.double(dat)), Rmat = as.matrix(dat), : BLAS/LAPACK routine 'DLASCL' gave error code -4
Traceback:
1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa[, 1:10], tree = tree, pheno_error = FALSE,
. phylo_correlated = TRUE) # at line 118-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
4. tp(L = X_complete, R = as.matrix(as.double(dat)), Rmat = as.matrix(dat),
. mL = ncol(X), mR = 1, pheno_error = pheno_error, edge_vec = edge_vec,
. edge_ind = edge_ind, ind_edge = ind_edge, parent_edges = parent_edges,
. pars = pars, nvar = nvar, phylocov_diag = as.integer(!phylo_correlated),
. nind = nind, nob = nob, nspecies = nspecies, nedge = nedge,
. anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset_complete,
. un_species_subset = un_species_subset_complete, subset_list = subset_list_complete,
. ind_list = ind_list_complete, tip_combn = tip_combn_complete,
. is_edge_ind = is_edge_ind, fixed_mu = matrix(0), ret_level = 2,
. is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]),
. phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list),
. phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]),
. phenocov_fixed = phenocov_fixed, OU_len = list())
If just I provide the first 20 columns of the data.frame:
Error in if (ll2 < ll) {: missing value where TRUE/FALSE needed
Traceback:
1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa[, 1:20], tree = tree, pheno_error = FALSE,
. phylo_correlated = TRUE) # at line 118-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
If I create a tree via:
tree = simtraits(ntaxa = nrow(res_taxa), ntraits = 4, nreps = 1, nmissing = 10)$tree
tree$tip.label = res_taxa$species
...I get the error:
Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not compatible with requested type: [type=character; target=double].
Traceback:
1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa, tree = tree) # at line 122-123 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
4. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1,
. pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind,
. ind_edge = ind_edge, parent_edges = parent_edges, pars = pars,
. nvar = nvar, phylocov_diag = as.integer(!phylo_correlated),
. nind = nind, nob = nob, nspecies = nspecies, nedge = nedge,
. anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset,
. un_species_subset = un_species_subset, subset_list = subset_list,
. ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind,
. fixed_mu = matrix(0), ret_level = 1, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]),
. phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list),
. phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]),
. phenocov_fixed = phenocov_fixed, OU_len = list())
...so it appears that the problem is not my tree
If I use simtraits
to simulate traits and a tree with the same dimensions of my actual data:
simtraits(ntaxa = nrow(res_taxa), ntraits = ncol(res_taxa), nreps = 1, nmissing = 10)
I get:
Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not compatible with requested type: [type=character; target=double].
Traceback:
1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa, tree = tree) # at line 124-125 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov,
. phenocov = phenocov, mu = mu)
4. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1,
. pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind,
. ind_edge = ind_edge, parent_edges = parent_edges, pars = biased_pars,
. nvar = nvar, phylocov_diag = as.integer(!phylo_correlated),
. nind = nind, nob = nob, nspecies = nspecies, nedge = nedge,
. anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset,
. un_species_subset = un_species_subset, subset_list = subset_list,
. ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind,
. fixed_mu = matrix(0), ret_level = 3, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]),
. phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list),
. phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]),
. phenocov_fixed = phenocov_fixed, OU_len = list())
If I simulate with <= 30 traits, phylopars()
completes successfully.
sessionInfo:
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Anxiety_Twins_Metagenomes/envs/tidyverse-ML2/lib/libopenblasp-r0.3.12.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rphylopars_0.3.2 BBmisc_1.11 checkmate_2.0.0
[4] parallelMap_1.5.0 randomForest_4.6-14 mlr_2.18.0
[7] ParamHelpers_1.14 Boruta_7.0.0 LeyLabRMisc_0.1.8
[10] ape_5.4-1 tidytable_0.5.8 data.table_1.13.6
[13] ggplot2_3.3.3 tidyr_1.1.2 dplyr_1.0.4
loaded via a namespace (and not attached):
[1] subplex_1.6 nlme_3.1-152 matrixStats_0.58.0
[4] repr_1.1.3 numDeriv_2016.8-1.1 Deriv_4.1.2
[7] tools_4.0.3 backports_1.2.1 R6_2.5.0
[10] colorspace_2.0-0 withr_2.4.1 tidyselect_1.1.0
[13] mnormt_2.0.2 phangorn_2.5.5 compiler_4.0.3
[16] expm_0.999-6 sandwich_3.0-0 labeling_0.4.2
[19] scales_1.1.1 mvtnorm_1.1-1 quadprog_1.5-8
[22] pbdZMQ_0.3-5 digest_0.6.27 base64enc_0.1-3
[25] pkgconfig_2.0.3 htmltools_0.5.1.1 parallelly_1.23.0
[28] plotrix_3.8-1 geiger_2.0.7 maps_3.3.0
[31] rlang_0.4.10 phylolm_2.6.2 farver_2.0.3
[34] generics_0.1.0 zoo_1.8-8 combinat_0.0-8
[37] jsonlite_1.7.2 gtools_3.8.2 magrittr_2.0.1
[40] modeltools_0.2-23 Matrix_1.3-2 Rcpp_1.0.6
[43] IRkernel_1.1.1 munsell_0.5.0 lifecycle_0.2.0
[46] scatterplot3d_0.3-41 stringi_1.5.3 multcomp_1.4-16
[49] clusterGeneration_1.3.7 MASS_7.3-53.1 grid_4.0.3
[52] listenv_0.8.0 parallel_4.0.3 strucchange_1.5-2
[55] crayon_1.4.1 lattice_0.20-41 IRdisplay_1.0
[58] splines_4.0.3 tmvnsim_1.0-2 pillar_1.4.7
[61] igraph_1.2.6 uuid_0.1-4 party_1.3-6
[64] future.apply_1.7.0 codetools_0.2-18 stats4_4.0.3
[67] fastmatch_1.1-0 XML_3.99-0.5 glue_1.4.2
[70] evaluate_0.14 doBy_4.6.8 deSolve_1.28
[73] vctrs_0.3.6 gtable_0.3.0 purrr_0.3.4
[76] future_1.21.0 coin_1.4-1 libcoin_1.0-8
[79] broom_0.7.5 phytools_0.7-70 coda_0.19-4
[82] survival_3.2-7 tibble_3.0.6 cluster_2.1.1
[85] globals_0.14.0 TH.data_1.0-10 ellipsis_0.3.1
I think clades with either 1) perfectly correlated or 2) non-variable traits creates a matrix inversion issue as the algorithm traverses down the tree from tips to root. I don't know why the error messages are so different based on the number of taxa and traits included, but there are a lot of optimization helper steps to improve starting parameters -- perhaps the number of species/traits/etc determines where in the code singularity occurs. I'm not surprised it fails when the number of traits approaches the number of species, but the error messages aren't helpful at all, and some of the errors you saw with lower trait dimensions were surprising. In any case, I would suspect perfect correlations and/or non-variable traits would be a common occurrence with factors. Rphylopars is currently ill-equipped to deal with that issue, but it will be handled in the major update currently under development. For issues regarding the numbe rof traits, I think a new expectation maximization algorithm will take care of that, but we'll still be limited to ntraits < nspecies, absent some algorithmic trick to avoid errors.
Thanks for the explanation! The dev branch seems to help with some of these errors. Could one batch the traits in any feasible way (e.g., randomly batch traits) so that ntraits < nspecies, or would that lead to inaccurate results?
It's a tricky issue and a classic dilemma for sure, and the best approach really depends on the nature of the dataset and what you're trying to assess. If there is only 1 observation per species, there are viable solutions (see below), but there are trade-offs no matter what you do. If you're interested in e.g. pairwise correlations only (i.e., any 2 traits in a vacuum), then it might be fair to isolate the traits. For >2 traits in a model, if there is missing data and/or multiple observations per species, then the estimated variances and covariances will differ depending on which traits you include -- a bit of a disturbing property. Usually the differences won't be that large, but very strong correlations between traits can make or break a 'significant' correlation. The reason this happens is a bit nuanced -- it seems like a bug upon first glance, but it's inherent to pretty much any mixed model or missing data framework. Because the model estimates species means from trait-trait covariance, if there's either missing data or multiple within-species observations, then the estimated correlations among traits help inform the species-level means, and therefore the choice of traits can influence the estimated correlations. Hopefully that makes some sense...
Beyond Rphylopars, there are methods that explicitly account for 'high-dimensional' traits, most often applied to geometric morphometrics studies. However, these approaches are pretty computationally demanding, so it isn't (yet) feasible to incorporate multiple observations per species (though an imputation approach is suggested for missing data in the Discussion of Clavel et al. 2020). On the other hand, if you have 1) exactly 1 observation per taxon, per trait, and 2) assume that every trait evolved under the same evolutionary model (e.g. every trait evolved under Brownian motion; or every trait has the same phylogenetic signal using Pagel's lambda; or every trait shares the same OU alpha parameter across all traits), then you can use the mvMORPH package. It fits high-dimensional phylogenetic comparative models using a penalized likelihood framework. The mvgls function is described with a worked example in section 6 of this mvMORPH vignette. You can fit a multivariate Y to any set of predictors, or an intercept-only model if you're only interested in the multivariate Y. You can also perform phylogenetic MANOVA, as well as (regularized) phylogenetic PCA on intercept-only models.
There are more details in these two publications, along with helpful R code in the online supplements to further illustrate the methods.
Julien Clavel, Hélène Morlon, Reliable Phylogenetic Regressions for Multivariate Comparative Data: Illustration with the MANOVA and Application to the Effect of Diet on Mandible Morphology in Phyllostomid Bats, Systematic Biology, Volume 69, Issue 5, September 2020, Pages 927–943, https://doi.org/10.1093/sysbio/syaa010
Julien Clavel, Leandro Aristide, Hélène Morlon, A Penalized Likelihood Framework for High-Dimensional Phylogenetic Comparative Methods and an Application to New-World Monkeys Brain Evolution, Systematic Biology, Volume 68, Issue 1, January 2019, Pages 93–116, https://doi.org/10.1093/sysbio/syy045
I'm running
phylopars()
on a set of traits (a total of 158 samples across 117 species). All continuous traits work, but if I try to code a 3-level factor asc(0, 0.5, 1)
orc(1, 2, 3)
I get the following error when trying to runphylopars()
on that trait:sessionInfo: