tarensanders commented 1 year ago

Just a quick thought of something we should double check: each row in the dataframe is a day, but a bunch of the variables that are being imputed (age, gender, BMI, etc) don't vary by day. The imputation doesn't know that though.

Here's a quick check:

tar_load(data_imp)
as_tibble(mice::complete(data_imp, 1)) %>%
  select(studyid, filename, sex, age, weight, height, ses, country) %>%
  distinct() %>%
  group_by(studyid, filename) %>%
  filter(n() > 1)

Returns:

# A tibble: 4,128 × 8
# Groups:   studyid, filename [506]
   studyid filename sex      age weight height ses    country       
     <dbl> <chr>    <fct>  <dbl>  <dbl>  <dbl> <fct>  <fct>         
 1     100 10123    Male    10.5   66.7   148. High   Portugal      
 2     100 10123    Male    10.5   66.7   148. Medium Portugal      
 3     100 10123    Male    10.5   66.7   148. Low    Portugal      
 4     100 10127    Female  10.2   39.8   134. High   Portugal      
 5     100 10127    Female  10.2   39.8   134. Medium Portugal      
 6     100 10127    Female  10.2   39.8   134. Low    Portugal      
 7     100 10169    Female  10.9   37.3   159. Medium United Kingdom
 8     100 10169    Female  10.9   37     159. Medium United Kingdom
 9     100 10169    Female  10.9   70     159. Medium United Kingdom
10     100 10169    Female  10.9   51.8   159. Medium United Kingdom

Note how wildly weight varies for 10169.

conig commented 1 year ago

Is this happening because they have missing values e.g., in weight for some days and not by others? One solution might be to replace NAs with the mean values within participant for these cases. Then hopefully the imputation procedure won't have to touch these variables?

tarensanders commented 1 year ago

No the participant data gets matched to the accelerometer data when I create the dataset, so you either have it for all rows or for none.

conig commented 1 year ago

Hmmm. In that case, it might be better to do post-processing and just average all the ages generated for each participant. Or maybe having the variance in these vars is appropriate as it speaks to uncertainty? What do you think?

conig commented 1 year ago

Another idea: two steps. Do imputation on a dataset with summaries per participant. Get those values and put back into the pre-imputation dataset. Then do imputation as normal.

mnoetel commented 1 year ago

These ideas make sense. I think mode [categorical] or median [continuous] on the imputed data for the time invariant variables makes sense.

group by participant ID

summarise

            ## categorial variables = mode
            ## continuous variables = median

Data %>% select(- imputed variables) %>% left_join(data, summary)

From: James Conigrave @.> Date: Thursday, 23 March 2023 at 2:07 pm To: Motivation-and-Behaviour/sleepIPD_analysis @.> Cc: Subscribed @.***> Subject: Re: [Motivation-and-Behaviour/sleepIPD_analysis] Imputation on the time-invariant variables (Issue #66)

Another idea: two steps. Do imputation on a dataset with summaries per participant. Get those values and put back into the pre-imputation dataset. Then do imputation as normal.

— Reply to this email directly, view it on GitHubhttps://github.com/Motivation-and-Behaviour/sleepIPD_analysis/issues/66#issuecomment-1480540391, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACS6SXRHONZIDZ235LXSWHLW5O5APANCNFSM6AAAAAAWER6USE. You are receiving this because you are subscribed to this thread.Message ID: @.***>

tarensanders commented 1 year ago

Do imputation on a dataset with summaries per participant. Get those values and put back into the pre-imputation dataset. Then do imputation as normal.

The problem with that is that you only do the pooled analysis on the multiple datasets generated from the last imputation (on the per-day variables), but you don't get the variance in the 'fixed' variables. But, maybe it doesn't matter, since these aren't really the variables we care about?

The other option looks like it would be to specify this as a hierarchical dataset (observations nested within participants) and impute such that the level 2 variables are the same for each level 1 observation. Mice seems to support this.

conig commented 1 year ago

Do imputation on a dataset with summaries per participant. Get those values and put back into the pre-imputation dataset. Then do imputation as normal.

The problem with that is that you only do the pooled analysis on the multiple datasets generated from the last imputation (on the per-day variables), but you don't get the variance in the 'fixed' variables. But, maybe it doesn't matter, since these aren't really the variables we care about?

The other option looks like it would be to specify this as a hierarchical dataset (observations nested within participants) and impute such that the level 2 variables are the same for each level 1 observation. Mice seems to support this.

This is the way. I will start a branch and have a go at this.

Motivation-and-Behaviour / sleepIPD_analysis

Imputation on the time-invariant variables #66

group by participant ID

summarise