ericgoolsby / Rphylopars

Phylogenetic Comparative Tools for Missing Data and Within-Species Variation
28 stars 11 forks source link

Cross validation tests #51

Open liamjakehughes opened 2 years ago

liamjakehughes commented 2 years ago

Hi @ericgoolsby!

I've noticed in an older version of the package there was a function to conduct cross validation tests of a fitted phylopars object (function: 'phylopars.crossvalidate'). I was just wondering if you knew the best way of going about this now that that function is no longer available?

Cheers, Liam

ericgoolsby commented 2 years ago

Great question! Rphylopars was originally written to mimic the cross-validations in the original PhyloPars web application. I wrote it prior to adding new features that weren't in PhyloPars. I can't remember exactly what the issue was, but I think it was the ability to model within-species correlations broke the cross-validation function. It would have required a total rewrite to fix it and I was on a bit of a time crunch as a grad student, so I dropped it from the package entirely. At the time, I was more interested in the covariance matrices than the imputations, but I agree this is an important feature!

Here's a very rough example of how one might go about it, but there are several things to consider (see below).

library(Rphylopars)
tree <- rtree(n=15)
interspecific_rate <- diag(3)*0.3+0.7
intraspecific_rate <- diag(3)*0.3+0.5

simdat <- simtraits(ntraits = 4,nreps = 3,nmissing = 15,tree = tree,
          v = interspecific_rate,intraspecific = intraspecific_rate)
trait_data <- simdat$trait_data
p <- phylopars(trait_data = trait_data,tree = tree,model = "BM")

# imputed_ind = individual-level imputations
# imputed_sp = species-level imputations
imputed_ind <- imputed_sp <- trait_data
imputed_ind[,-1] <- imputed_sp[,-1] <- NA
for(i in 1:nrow(trait_data))
{
  for(j in 1:(ncol(trait_data)-1))
  {
    temp_trait_data <- trait_data
    temp_trait_data[i,j+1] <- NA
    p_temp <- phylopars(trait_data = temp_trait_data,tree = tree,model = "BM",phylocov_fixed = 
              p$pars$phylocov,phenocov_fixed = p$pars$phenocov)

    imputed_ind[i,j+1] <- p_temp$ind_recon[i,j+1]
    imputed_sp[i,j+1] <- p_temp$anc_recon[trait_data$species[i],j]
  }
}

ind_mean_abs_error <- colMeans(imputed_ind[,-1] - trait_data[,-1],na.rm=TRUE)
ind_mean_sq_error <- colMeans((imputed_ind[,-1] - trait_data[,-1])^2,na.rm=TRUE)

sp_mean_abs_error <- colMeans(imputed_sp[,-1] - trait_data[,-1],na.rm=TRUE)
sp_mean_sq_error <- colMeans((imputed_sp[,-1] - trait_data[,-1])^2,na.rm=TRUE)

cat("Traits\t\t\t\t  ",paste(colnames(trait_data)[-1],collapse = "    "),"\n",
    "Avg individual-level bias:\t",paste(round(ind_mean_abs_error,3),collapse=" "),"\n",
    "Avg individual-level err^2:\t",paste(round(ind_mean_sq_error,3),collapse=" "),"\n",
    "Avg species-level bias:\t\t",paste(round(sp_mean_abs_error,3),collapse=" "),"\n",
    "Avg species-level err^2:\t",paste(round(sp_mean_sq_error,3),collapse=" "),sep="")

# Traits                  V1    V2    V3
# Avg individual-level bias:    0.031 0.013 -0.055
# Avg individual-level err^2:   0.797 0.742 0.885
# Avg species-level bias:       0.072 0.002 -0.031
# Avg species-level err^2:  1.169 0.922 1.146

First I fit the model on the full dataset, then I looped over each individual observation and trait, holding out one a time. Using the trait covariance parameters estimated from the full model (i.e. evolutionary rate "phylocov" and within-species variance "phenocov"), I plugged in the reduced dataset and recorded the imputations. Important distinction there: I recorded the individual-level imputations, which takes into account the non-missing trait values for the individual, as well as the species-level imputations, which does not take into account anything but the estimated species mean.

Depending on why you want cross-validations, the above approach might not be 'fair' in the sense that perhaps you should hold out all traits for an individual, or you might hold out all observations for a single trait for an entire species, or you might hold out all observations for an entire species. These all give you different pieces of information, and it's important to think about what you're trying to learn about the imputations.

Another important point: I fixed the trait covariance parameters (phylocov and phenocov) using the values obtained from the full model. You might instead wish to re-estimate these matrices for each iteration of cross-validation. Similarly, if you're fitting a Pagel's lambda or OU model, you may (or may not) want to re-estimate those parameters at each iteration.

Another issue that's a bit complicated to think about -- the cross-validations themselves are non-independent due to the phylogenetic structure in the data, which could bias error estimates (see Roberts et al. 2017: https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881).

I think if I were to write a cross-validation function, all of these options would be available so the user could specify exactly what they want. If you have any thoughts on what sort of features this sort of function should have, I'd appreciate the input!