Open bcjaeger opened 4 months ago
@BrianNathanWhite, this is what I was thinking about in #4. If anyone wanted to store study data inside of data-raw
, it would be good to have a designated folder like data-raw/sensitive
that would be inside of .gitignore
so that git
would never try to track real data files that were stored there.
Why bother? Because if two people were to work on the same cohort data, then keeping that cohort data in data-raw/sensitive
would mean the same file path works for both people when those data are accessed. This would also be helpful if one person is reviewing a PR from another person and wants to check out the code locally.
One of
OpenLong
goals is to provide synthetic harmonized cohort data that can be loaded lazily. A first step to getting to that point would be to set up a folder calleddata-raw
and a separate folder nameddata
. See https://r-pkgs.org/data.html#sec-data-data-raw for an explanation of why we want to include thedata-raw
directory.Inside
data-raw
we should aim to create one R script for the cleaning process of each cohort. For illustration, we should have one R script that acts on a publicly available longitudinal dataset. This will give a nice template for the non-public datasets.At the end of each R script, we should include a step that generates synthetic data based on the cleaned data. Then, the synthetic data will be what gets saved into the
data
directory.