BrianNathanWhite / OpenLong

Shares Synthetic Longitudinal Data And Code For Formatting Real Data
Other
2 stars 2 forks source link

A system for data prep that will be easy to maintain #5

Closed bcjaeger closed 2 weeks ago

bcjaeger commented 1 month ago

Based on our previous discussion, I understand we want OpenLong to provide a function for each cohort that can produce the cleaned and harmonized version of the cohort data. This is important for users who have access to the real data, as they should be able to run the OpenLong functions and get real, cleaned data as an output. We should discuss and hopefully come to an agreement about what system would work best for managing this in OpenLong.

I have some thoughts on what could work well. The idea is to write several functions for each cohort:

  1. data_load() - this function should pull the relevant files out of the package of files that would be delivered from NHLBI BioLINCC. It will need to be customized for each cohort since each cohort comes with a different bundle of files in the delivered data.
  2. data_clean() - this function should modify existing columns in the dataset to align them with out standard for harmonizing. For example, if a variable initially has values of, e.g., -999, to indicate missingness, this function would set those to NA_real or whatever type of NA matches the format of that column. Another action taken by this function would be recoding factor levels to match the standard we set for the given variable in our harmonization. E.g., one cohort may have levels of "M" and "F". Assuming we decided to universally code these as 'male' and 'female' we would do the recoding in this function.
  3. data_derive() - this function should create new columns using the cleaned versions of existing columns.
  4. data_exclude() - this function should filter out rows that do not meet inclusion criteria for our harmonized datasets. If we don't plan to have inclusion criteria, then this function isn't needed.
  5. data_rename() - this function should map a datasets existing names to the harmonized names. It should just leverage the .xlsx file that Brian prepared. We could make this .xlsx file one of the internal datasets used by OpenLong for convenience.
  6. data_finalize() - this function should separate a cross-sectional and longitudinal dataset given a

The reason I feel adamant about clean distinctions b/t the purpose of each function is for our own convenience. When we are asked questions like, "how did you create that variable", we know exactly where to look (in the data_derive() function). It also helps us write a checklist out when we start harmonizing a new dataset.

We don't have to actually implement this system to resolve this issue, we just have to discuss and make a plan.

bcjaeger commented 3 weeks ago

One thing I've started to think on is having a dedicated class for each study, i.e., an object of class health_abc, which will basically just be a list with elements long and base and a few extra attributes. The upside of this is that we can just write a function to create new objects of class health_abc and then write generic functions like data_derive.health_abc(), data_rename.health_abc, etc. This is mainly to benefit us and make it easy for each of us to find specific code. E.g., if Brian writes the code for ABC and I'm trying to see where a certain variable was derived, I know where to look - data_derive.health_abc().

BrianNathanWhite commented 3 weeks ago

What you've outlined seems sensible to me.

I think systematically partitioning the data processing pipe-line this way will allow for greater transparency, quality control, ease of use and, as a consequence, buy-in from potential users.

bcjaeger commented 3 weeks ago

Thanks! I appreciate you reviewing. I talked with Simar today and we sketched how an API might look for this process. I'm dropping some code below that illustrates a potential API. What do you think of this workflow @BrianNathanWhite?


# initiate empty objects that will read data if it exists
# if there is no filepath, simulated data will be used

mesa <- openlong_new(data = 'mesa', filepath = "")

abc <- openlong_new(data = 'abc', filepath = "")

# process the objects simultaneously before pooling them

list(mesa, abs) %>% 
  map_dfr(
    .f = ~ .x %>% 
      data_load() %>% 
      data_clean(bmi, glucose) %>% 
      data_derive(all_variables()) %>% 
      data_exclude(age > 60),
    .id = 'study'
  )
BrianNathanWhite commented 3 weeks ago

I think this makes sense; logical and reads naturally with the pipe structure.

bcjaeger commented 3 weeks ago

Fantastic =] this will give us a chance to learn about and use the new S7 object-oriented programming package. I think this will be a very prominent system for R packages in the near future.

bcjaeger commented 3 weeks ago

I will start putting some bones together for this.