Closed bcjaeger closed 2 weeks ago
One thing I've started to think on is having a dedicated class for each study, i.e., an object of class health_abc
, which will basically just be a list with elements long
and base
and a few extra attributes. The upside of this is that we can just write a function to create new objects of class health_abc
and then write generic functions like data_derive.health_abc()
, data_rename.health_abc
, etc. This is mainly to benefit us and make it easy for each of us to find specific code. E.g., if Brian writes the code for ABC and I'm trying to see where a certain variable was derived, I know where to look - data_derive.health_abc()
.
What you've outlined seems sensible to me.
I think systematically partitioning the data processing pipe-line this way will allow for greater transparency, quality control, ease of use and, as a consequence, buy-in from potential users.
Thanks! I appreciate you reviewing. I talked with Simar today and we sketched how an API might look for this process. I'm dropping some code below that illustrates a potential API. What do you think of this workflow @BrianNathanWhite?
# initiate empty objects that will read data if it exists
# if there is no filepath, simulated data will be used
mesa <- openlong_new(data = 'mesa', filepath = "")
abc <- openlong_new(data = 'abc', filepath = "")
# process the objects simultaneously before pooling them
list(mesa, abs) %>%
map_dfr(
.f = ~ .x %>%
data_load() %>%
data_clean(bmi, glucose) %>%
data_derive(all_variables()) %>%
data_exclude(age > 60),
.id = 'study'
)
I think this makes sense; logical and reads naturally with the pipe structure.
Fantastic =] this will give us a chance to learn about and use the new S7 object-oriented programming package. I think this will be a very prominent system for R packages in the near future.
I will start putting some bones together for this.
Based on our previous discussion, I understand we want
OpenLong
to provide a function for each cohort that can produce the cleaned and harmonized version of the cohort data. This is important for users who have access to the real data, as they should be able to run theOpenLong
functions and get real, cleaned data as an output. We should discuss and hopefully come to an agreement about what system would work best for managing this inOpenLong
.I have some thoughts on what could work well. The idea is to write several functions for each cohort:
data_load()
- this function should pull the relevant files out of the package of files that would be delivered from NHLBI BioLINCC. It will need to be customized for each cohort since each cohort comes with a different bundle of files in the delivered data.data_clean()
- this function should modify existing columns in the dataset to align them with out standard for harmonizing. For example, if a variable initially has values of, e.g., -999, to indicate missingness, this function would set those toNA_real
or whatever type ofNA
matches the format of that column. Another action taken by this function would be recoding factor levels to match the standard we set for the given variable in our harmonization. E.g., one cohort may have levels of "M" and "F". Assuming we decided to universally code these as 'male' and 'female' we would do the recoding in this function.data_derive()
- this function should create new columns using the cleaned versions of existing columns.data_exclude()
- this function should filter out rows that do not meet inclusion criteria for our harmonized datasets. If we don't plan to have inclusion criteria, then this function isn't needed.data_rename()
- this function should map a datasets existing names to the harmonized names. It should just leverage the .xlsx file that Brian prepared. We could make this .xlsx file one of the internal datasets used byOpenLong
for convenience.data_finalize()
- this function should separate a cross-sectional and longitudinal dataset given aThe reason I feel adamant about clean distinctions b/t the purpose of each function is for our own convenience. When we are asked questions like, "how did you create that variable", we know exactly where to look (in the
data_derive()
function). It also helps us write a checklist out when we start harmonizing a new dataset.We don't have to actually implement this system to resolve this issue, we just have to discuss and make a plan.