ericgoolsby / Rphylopars

Phylogenetic Comparative Tools for Missing Data and Within-Species Variation
30 stars 11 forks source link

BLAS/LAPACK routine 'DLASCLS' gave error code -4 #41

Open nick-youngblut opened 4 years ago

nick-youngblut commented 4 years ago

I'm running phylopars() on a set of traits (a total of 158 samples across 117 species). All continuous traits work, but if I try to code a 3-level factor as c(0, 0.5, 1) or c(1, 2, 3) I get the following error when trying to run phylopars() on that trait:

Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : BLAS/LAPACK routine 'DLASCLS' gave error code -4
Traceback:

1. phylopars(trait_data = X[, c(1, 2)], tree = host_tree)
2. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
3. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, 
 .     pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind, 
 .     ind_edge = ind_edge, parent_edges = parent_edges, pars = pars, 
 .     nvar = nvar, phylocov_diag = as.integer(!phylo_correlated), 
 .     nind = nind, nob = nob, nspecies = nspecies, nedge = nedge, 
 .     anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset, 
 .     un_species_subset = un_species_subset, subset_list = subset_list, 
 .     ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind, 
 .     fixed_mu = mu, ret_level = 3, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]), 
 .     phylocov_fixed = phylocov_fixed, is_phenocov_list = 0, phenocov_list = list(), 
 .     is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]), 
 .     phenocov_fixed = phenocov_fixed, OU_len = list())

sessionInfo:

R version 4.0.0 (2020-04-24)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/phyloseq-physig/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] LeyLabRMisc_0.1.6 doParallel_1.0.15 iterators_1.0.12  foreach_1.5.0    
 [5] Rphylopars_0.3.0  phytools_0.7-47   maps_3.3.0        ape_5.4          
 [9] ggplot2_3.3.1     tidyr_1.1.0       dplyr_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6            subplex_1.6             mvtnorm_1.1-1          
 [4] lattice_0.20-41         phylolm_2.6             listenv_0.8.0          
 [7] gtools_3.8.2            assertthat_0.2.1        RhpcBLASctl_0.20-137   
[10] digest_0.6.25           IRdisplay_0.7.0         R6_2.4.1               
[13] repr_1.1.0              backports_1.1.7         evaluate_0.14          
[16] coda_0.19-3             pillar_1.4.4            rlang_0.4.6            
[19] uuid_0.1-4              phangorn_2.5.5          Matrix_1.2-18          
[22] combinat_0.0-8          igraph_1.2.5            munsell_0.5.0          
[25] broom_0.5.6             compiler_4.0.0          numDeriv_2016.8-1.1    
[28] Deriv_4.0               geiger_2.0.7            pkgconfig_2.0.3        
[31] base64enc_0.1-3         mnormt_1.5-7            globals_0.12.5         
[34] htmltools_0.4.0         tidyselect_1.1.0        tibble_3.0.1           
[37] expm_0.999-4            quadprog_1.5-8          codetools_0.2-16       
[40] fansi_0.4.1             future_1.17.0           crayon_1.3.4           
[43] withr_2.2.0             MASS_7.3-51.6           grid_4.0.0             
[46] nlme_3.1-148            jsonlite_1.6.1          gtable_0.3.0           
[49] lifecycle_0.2.0         magrittr_1.5            scales_1.1.1           
[52] cli_2.0.2               future.apply_1.5.0      scatterplot3d_0.3-41   
[55] doBy_4.6.6              ellipsis_0.3.1          generics_0.0.2         
[58] vctrs_0.3.1             fastmatch_1.1-0         IRkernel_1.1           
[61] deSolve_1.28            tools_4.0.0             glue_1.4.1             
[64] purrr_0.3.4             plotrix_3.7-8           colorspace_1.4-1       
[67] pbdZMQ_0.3-3            animation_2.6           clusterGeneration_1.3.4
nick-youngblut commented 4 years ago

If I include another, fully continuous trait, I get the error:

Error in data.frame(species = trait_data$species, imputed$recon_ind): arguments imply differing number of rows: 185, 0
Traceback:

1. phylopars(trait_data = trait_df, tree = host_tree)
2. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
3. data.frame(species = trait_data$species, imputed$recon_ind)
4. stop(gettextf("arguments imply differing number of rows: %s", 
 .     paste(unique(nrows), collapse = ", ")), domain = NA)

...but if I leave out the 3-level factor trait and just include the "species" column and the continuous trait column, then phylopars works correctly.

nick-youngblut commented 4 years ago

I've been playing around with this, and it seems that a lot of various inputs that include multi-level factor traits (converted to numeric) and continuous variables will generate the error:

Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not a matrix.
Traceback:

1. phylopars(trait_data = trait_df, 
 .     tree = host_tree)
2. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
3. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, 
 .     pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind, 
 .     ind_edge = ind_edge, parent_edges = parent_edges, pars = pars, 
 .     nvar = nvar, phylocov_diag = as.integer(!phylo_correlated), 
 .     nind = nind, nob = nob, nspecies = nspecies, nedge = nedge, 
 .     anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset, 
 .     un_species_subset = un_species_subset, subset_list = subset_list, 
 .     ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind, 
 .     fixed_mu = mu, ret_level = 3, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]), 
 .     phylocov_fixed = phylocov_fixed, is_phenocov_list = 0, phenocov_list = list(), 
 .     is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]), 
 .     phenocov_fixed = phenocov_fixed, OU_len = list())

...while some multi-level traits do work. It's really unclear why some work and some generate this error.

nick-youngblut commented 3 years ago

I am trying to run phylopars() on a set of continuous traits. A preview of my input data.frame:

species           V2
s__Acetatifactor_sp900066365 s__Acetatifactor_sp900066365 0.0007380356
s__Acetatifactor_sp900066565 s__Acetatifactor_sp900066565 0.0005074156
s__Agathobacter_faecis             s__Agathobacter_faecis 0.0025116876
s__Agathobacter_rectalis         s__Agathobacter_rectalis 0.0131294618
s__Agathobacter_sp900317585   s__Agathobacter_sp900317585 0.0031340672
                                      V3           V4           V5          V6
s__Acetatifactor_sp900066365 0.001421438 0.0006370219 0.0005877497 0.001166753
s__Acetatifactor_sp900066565 0.001567913 0.0008604320 0.0008487620 0.001347186
s__Agathobacter_faecis       0.004484746 0.0025899966 0.0017205411 0.003644285
s__Agathobacter_rectalis     0.009425716 0.0141831984 0.0032517141 0.008315006
s__Agathobacter_sp900317585  0.002460970 0.0032725461 0.0008395742 0.002190721
                                       V7           V8          V9          V10
s__Acetatifactor_sp900066365 0.0005121238 0.0002571268 0.001762162 0.0007861735
s__Acetatifactor_sp900066565 0.0003313521 0.0004847163 0.001853193 0.0102120276
s__Agathobacter_faecis       0.0011752616 0.0001254030 0.005622542 0.0002754917
s__Agathobacter_rectalis     0.0043995029 0.0010882833 0.011637338 0.0042903712
s__Agathobacter_sp900317585  0.0014388840 0.0002850596 0.003105409 0.0007939048
                                      V11          V12          V13
s__Acetatifactor_sp900066365 0.0004411348 0.0005846794 0.0006182063
s__Acetatifactor_sp900066565 0.0007553928 0.0005286972 0.0050530630
s__Agathobacter_faecis       0.0182341994 0.0031211565 0.0008389246
s__Agathobacter_rectalis     0.0152277543 0.0114247042 0.0031311814
s__Agathobacter_sp900317585  0.0028675800 0.0021177798 0.0006213883
                                      V14         V15          V16          V17
s__Acetatifactor_sp900066365 0.0001938570 0.000818661 0.0001200000 0.0002485115
s__Acetatifactor_sp900066565 0.0005943459 0.003482738 0.0007528935 0.0009862291
s__Agathobacter_faecis       0.0004640492 0.002320781 0.0001594738 0.0001920727
s__Agathobacter_rectalis     0.0101299631 0.055806823 0.0043490765 0.0012795696
s__Agathobacter_sp900317585  0.0032585503 0.012220729 0.0011372366 0.0002485115
                                     V18          V19         V20
s__Acetatifactor_sp900066365 0.001548092 0.0004694376 0.005377131
s__Acetatifactor_sp900066565 0.001314906 0.0037912056 0.002098138
s__Agathobacter_faecis       0.003875061 0.0021836612 0.006754514
s__Agathobacter_rectalis     0.009067108 0.0142452002 0.008615715
s__Agathobacter_sp900317585  0.002191376 0.0034489308 0.002204897

Dims of my full data.frame: 135 97

If I provide the entire data.frame to phylopars() (default params), I get:

Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not compatible with requested type: [type=character; target=double].
Traceback:

1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa, tree = tree)   # at line 120-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
4. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, 
 .     pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind, 
 .     ind_edge = ind_edge, parent_edges = parent_edges, pars = pars, 
 .     nvar = nvar, phylocov_diag = as.integer(!phylo_correlated), 
 .     nind = nind, nob = nob, nspecies = nspecies, nedge = nedge, 
 .     anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset, 
 .     un_species_subset = un_species_subset, subset_list = subset_list, 
 .     ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind, 
 .     fixed_mu = matrix(0), ret_level = 1, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]), 
 .     phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list), 
 .     phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]), 
 .     phenocov_fixed = phenocov_fixed, OU_len = list())

If just I provide the first 3 columns of the data.frame:

Error in if (ll2 < ll) {: missing value where TRUE/FALSE needed
Traceback:

1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa[, 1:3], tree = tree, pheno_error = FALSE, 
 .     phylo_correlated = TRUE)   # at line 118-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)

If just I provide the first 10 columns of the data.frame:

Error in tp(L = X_complete, R = as.matrix(as.double(dat)), Rmat = as.matrix(dat), : BLAS/LAPACK routine 'DLASCL' gave error code -4
Traceback:

1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa[, 1:10], tree = tree, pheno_error = FALSE, 
 .     phylo_correlated = TRUE)   # at line 118-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
4. tp(L = X_complete, R = as.matrix(as.double(dat)), Rmat = as.matrix(dat), 
 .     mL = ncol(X), mR = 1, pheno_error = pheno_error, edge_vec = edge_vec, 
 .     edge_ind = edge_ind, ind_edge = ind_edge, parent_edges = parent_edges, 
 .     pars = pars, nvar = nvar, phylocov_diag = as.integer(!phylo_correlated), 
 .     nind = nind, nob = nob, nspecies = nspecies, nedge = nedge, 
 .     anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset_complete, 
 .     un_species_subset = un_species_subset_complete, subset_list = subset_list_complete, 
 .     ind_list = ind_list_complete, tip_combn = tip_combn_complete, 
 .     is_edge_ind = is_edge_ind, fixed_mu = matrix(0), ret_level = 2, 
 .     is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]), 
 .     phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list), 
 .     phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]), 
 .     phenocov_fixed = phenocov_fixed, OU_len = list())

If just I provide the first 20 columns of the data.frame:

Error in if (ll2 < ll) {: missing value where TRUE/FALSE needed
Traceback:

1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa[, 1:20], tree = tree, pheno_error = FALSE, 
 .     phylo_correlated = TRUE)   # at line 118-121 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
nick-youngblut commented 3 years ago

If I create a tree via:

tree = simtraits(ntaxa = nrow(res_taxa), ntraits = 4, nreps = 1, nmissing = 10)$tree
tree$tip.label = res_taxa$species

...I get the error:

Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not compatible with requested type: [type=character; target=double].
Traceback:

1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa, tree = tree)   # at line 122-123 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
4. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, 
 .     pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind, 
 .     ind_edge = ind_edge, parent_edges = parent_edges, pars = pars, 
 .     nvar = nvar, phylocov_diag = as.integer(!phylo_correlated), 
 .     nind = nind, nob = nob, nspecies = nspecies, nedge = nedge, 
 .     anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset, 
 .     un_species_subset = un_species_subset, subset_list = subset_list, 
 .     ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind, 
 .     fixed_mu = matrix(0), ret_level = 1, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]), 
 .     phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list), 
 .     phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]), 
 .     phenocov_fixed = phenocov_fixed, OU_len = list())

...so it appears that the problem is not my tree

nick-youngblut commented 3 years ago

If I use simtraits to simulate traits and a tree with the same dimensions of my actual data:

simtraits(ntaxa = nrow(res_taxa), ntraits = ncol(res_taxa), nreps = 1, nmissing = 10)

I get:

Error in tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, : Not compatible with requested type: [type=character; target=double].
Traceback:

1. smote_phy(tsk, rate = 2, nn = 5, tree = phy_f)
2. phylopars(trait_data = res_taxa, tree = tree)   # at line 124-125 of file <text>
3. estim_pars(edge_vec = edge_vec, do_optim = TRUE, phylocov = phylocov, 
 .     phenocov = phenocov, mu = mu)
4. tp(L = X, R = R, Rmat = as.matrix(Rmat), mL = ncol(X), mR = 1, 
 .     pheno_error = pheno_error, edge_vec = edge_vec, edge_ind = edge_ind, 
 .     ind_edge = ind_edge, parent_edges = parent_edges, pars = biased_pars, 
 .     nvar = nvar, phylocov_diag = as.integer(!phylo_correlated), 
 .     nind = nind, nob = nob, nspecies = nspecies, nedge = nedge, 
 .     anc = anc, des = des, REML = as.integer(REML), species_subset = species_subset, 
 .     un_species_subset = un_species_subset, subset_list = subset_list, 
 .     ind_list = ind_list, tip_combn = tip_combn, is_edge_ind = is_edge_ind, 
 .     fixed_mu = matrix(0), ret_level = 3, is_phylocov_fixed = as.integer(!is.na(phylocov_fixed)[[1]]), 
 .     phylocov_fixed = phylocov_fixed, is_phenocov_list = length(phenocov_list), 
 .     phenocov_list = phenocov_list, is_phenocov_fixed = as.integer(!is.na(phenocov_fixed)[[1]]), 
 .     phenocov_fixed = phenocov_fixed, OU_len = list())

If I simulate with <= 30 traits, phylopars() completes successfully.

sessionInfo:

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Anxiety_Twins_Metagenomes/envs/tidyverse-ML2/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Rphylopars_0.3.2    BBmisc_1.11         checkmate_2.0.0    
 [4] parallelMap_1.5.0   randomForest_4.6-14 mlr_2.18.0         
 [7] ParamHelpers_1.14   Boruta_7.0.0        LeyLabRMisc_0.1.8  
[10] ape_5.4-1           tidytable_0.5.8     data.table_1.13.6  
[13] ggplot2_3.3.3       tidyr_1.1.2         dplyr_1.0.4        

loaded via a namespace (and not attached):
 [1] subplex_1.6             nlme_3.1-152            matrixStats_0.58.0     
 [4] repr_1.1.3              numDeriv_2016.8-1.1     Deriv_4.1.2            
 [7] tools_4.0.3             backports_1.2.1         R6_2.5.0               
[10] colorspace_2.0-0        withr_2.4.1             tidyselect_1.1.0       
[13] mnormt_2.0.2            phangorn_2.5.5          compiler_4.0.3         
[16] expm_0.999-6            sandwich_3.0-0          labeling_0.4.2         
[19] scales_1.1.1            mvtnorm_1.1-1           quadprog_1.5-8         
[22] pbdZMQ_0.3-5            digest_0.6.27           base64enc_0.1-3        
[25] pkgconfig_2.0.3         htmltools_0.5.1.1       parallelly_1.23.0      
[28] plotrix_3.8-1           geiger_2.0.7            maps_3.3.0             
[31] rlang_0.4.10            phylolm_2.6.2           farver_2.0.3           
[34] generics_0.1.0          zoo_1.8-8               combinat_0.0-8         
[37] jsonlite_1.7.2          gtools_3.8.2            magrittr_2.0.1         
[40] modeltools_0.2-23       Matrix_1.3-2            Rcpp_1.0.6             
[43] IRkernel_1.1.1          munsell_0.5.0           lifecycle_0.2.0        
[46] scatterplot3d_0.3-41    stringi_1.5.3           multcomp_1.4-16        
[49] clusterGeneration_1.3.7 MASS_7.3-53.1           grid_4.0.3             
[52] listenv_0.8.0           parallel_4.0.3          strucchange_1.5-2      
[55] crayon_1.4.1            lattice_0.20-41         IRdisplay_1.0          
[58] splines_4.0.3           tmvnsim_1.0-2           pillar_1.4.7           
[61] igraph_1.2.6            uuid_0.1-4              party_1.3-6            
[64] future.apply_1.7.0      codetools_0.2-18        stats4_4.0.3           
[67] fastmatch_1.1-0         XML_3.99-0.5            glue_1.4.2             
[70] evaluate_0.14           doBy_4.6.8              deSolve_1.28           
[73] vctrs_0.3.6             gtable_0.3.0            purrr_0.3.4            
[76] future_1.21.0           coin_1.4-1              libcoin_1.0-8          
[79] broom_0.7.5             phytools_0.7-70         coda_0.19-4            
[82] survival_3.2-7          tibble_3.0.6            cluster_2.1.1          
[85] globals_0.14.0          TH.data_1.0-10          ellipsis_0.3.1  
ericgoolsby commented 3 years ago

I think clades with either 1) perfectly correlated or 2) non-variable traits creates a matrix inversion issue as the algorithm traverses down the tree from tips to root. I don't know why the error messages are so different based on the number of taxa and traits included, but there are a lot of optimization helper steps to improve starting parameters -- perhaps the number of species/traits/etc determines where in the code singularity occurs. I'm not surprised it fails when the number of traits approaches the number of species, but the error messages aren't helpful at all, and some of the errors you saw with lower trait dimensions were surprising. In any case, I would suspect perfect correlations and/or non-variable traits would be a common occurrence with factors. Rphylopars is currently ill-equipped to deal with that issue, but it will be handled in the major update currently under development. For issues regarding the numbe rof traits, I think a new expectation maximization algorithm will take care of that, but we'll still be limited to ntraits < nspecies, absent some algorithmic trick to avoid errors.

nick-youngblut commented 3 years ago

Thanks for the explanation! The dev branch seems to help with some of these errors. Could one batch the traits in any feasible way (e.g., randomly batch traits) so that ntraits < nspecies, or would that lead to inaccurate results?

ericgoolsby commented 3 years ago

It's a tricky issue and a classic dilemma for sure, and the best approach really depends on the nature of the dataset and what you're trying to assess. If there is only 1 observation per species, there are viable solutions (see below), but there are trade-offs no matter what you do. If you're interested in e.g. pairwise correlations only (i.e., any 2 traits in a vacuum), then it might be fair to isolate the traits. For >2 traits in a model, if there is missing data and/or multiple observations per species, then the estimated variances and covariances will differ depending on which traits you include -- a bit of a disturbing property. Usually the differences won't be that large, but very strong correlations between traits can make or break a 'significant' correlation. The reason this happens is a bit nuanced -- it seems like a bug upon first glance, but it's inherent to pretty much any mixed model or missing data framework. Because the model estimates species means from trait-trait covariance, if there's either missing data or multiple within-species observations, then the estimated correlations among traits help inform the species-level means, and therefore the choice of traits can influence the estimated correlations. Hopefully that makes some sense...

Beyond Rphylopars, there are methods that explicitly account for 'high-dimensional' traits, most often applied to geometric morphometrics studies. However, these approaches are pretty computationally demanding, so it isn't (yet) feasible to incorporate multiple observations per species (though an imputation approach is suggested for missing data in the Discussion of Clavel et al. 2020). On the other hand, if you have 1) exactly 1 observation per taxon, per trait, and 2) assume that every trait evolved under the same evolutionary model (e.g. every trait evolved under Brownian motion; or every trait has the same phylogenetic signal using Pagel's lambda; or every trait shares the same OU alpha parameter across all traits), then you can use the mvMORPH package. It fits high-dimensional phylogenetic comparative models using a penalized likelihood framework. The mvgls function is described with a worked example in section 6 of this mvMORPH vignette. You can fit a multivariate Y to any set of predictors, or an intercept-only model if you're only interested in the multivariate Y. You can also perform phylogenetic MANOVA, as well as (regularized) phylogenetic PCA on intercept-only models.

There are more details in these two publications, along with helpful R code in the online supplements to further illustrate the methods.

Julien Clavel, Hélène Morlon, Reliable Phylogenetic Regressions for Multivariate Comparative Data: Illustration with the MANOVA and Application to the Effect of Diet on Mandible Morphology in Phyllostomid Bats, Systematic Biology, Volume 69, Issue 5, September 2020, Pages 927–943, https://doi.org/10.1093/sysbio/syaa010

Julien Clavel, Leandro Aristide, Hélène Morlon, A Penalized Likelihood Framework for High-Dimensional Phylogenetic Comparative Methods and an Application to New-World Monkeys Brain Evolution, Systematic Biology, Volume 68, Issue 1, January 2019, Pages 93–116, https://doi.org/10.1093/sysbio/syy045