ScottClaessens / coevolve

coevolve R package for Bayesian dynamic coevolutionary models using Stan
GNU General Public License v2.0
7 stars 0 forks source link

Inform users about missing data imputation and provide option for complete case analysis #79

Open ErikRingen opened 1 week ago

ErikRingen commented 1 week ago

Current, missing data is imputed automatically and silently. Given that most statistical software does not do this by default (or at all), I think we should (1) give a message to users that missing values are imputed during model fitting and (2) add an argument named something like complete.cases that is FALSE by default, but if TRUE then we just perform rowwise deletion whenever there are missing values in the traits being modelled.

ScottClaessens commented 1 week ago

I agree. I'll add a similar print statement as we have for repeated measures and distance matrices.

Sorry, I've been a bit behind on dealing with GitHub issues due to the workshop. Hoping to get to these soon.

ErikRingen commented 1 week ago

Sounds good! And never be sorry for pace on the issues, only when you have the time/energy :)

ScottClaessens commented 1 day ago

Working on this today. @ErikRingen currently we remove cases when all coevolving variables are NA, and impute values only when they have at least one variable with observed data. Should we continue to do this when users set complete.cases = FALSE?

ErikRingen commented 1 day ago

Yeah I think that is a sensible implementation. For imputing values when all variables are missing, we should probably handle that in the generated quantities block (I am thinking of, for example, predicting ancestral nodes).

ScottClaessens commented 1 day ago

Okay, cool. Don't we already have predictions for ancestral nodes in the eta parameters though?

ErikRingen commented 1 day ago

On the latent scale yes. But not for the observation model (for non-Gaussian resps).

ScottClaessens commented 1 day ago

Yep, makes sense!

ScottClaessens commented 1 day ago

currently we remove cases when all coevolving variables are NA, and impute values only when they have at least one variable with observed data. Should we continue to do this when users set complete.cases = FALSE?

Thinking about this a bit more, I think it is cleaner to just impute all NAs, even when taxa have missing data for all coevolving variables. As a possible use case for this, users might be interested to get posterior predictions for these taxa (informed by all variables in the model), even if these taxa don't contribute to estimating the parameters of the coevolutionary process. I think the model should probably fit fine in this case. It's also cleaner to describe in the documentation ("the model imputes all missing data").

What do you think @ErikRingen? Sorry for going back on what I said before!