BrianNathanWhite / OpenLong

Shares Synthetic Longitudinal Data And Code For Formatting Real Data
Other
2 stars 3 forks source link

data-raw directory #1

Open bcjaeger opened 4 months ago

bcjaeger commented 4 months ago

One of OpenLong goals is to provide synthetic harmonized cohort data that can be loaded lazily. A first step to getting to that point would be to set up a folder called data-raw and a separate folder named data. See https://r-pkgs.org/data.html#sec-data-data-raw for an explanation of why we want to include the data-raw directory.

Inside data-raw we should aim to create one R script for the cleaning process of each cohort. For illustration, we should have one R script that acts on a publicly available longitudinal dataset. This will give a nice template for the non-public datasets.

At the end of each R script, we should include a step that generates synthetic data based on the cleaned data. Then, the synthetic data will be what gets saved into the data directory.

bcjaeger commented 3 months ago

@BrianNathanWhite, this is what I was thinking about in #4. If anyone wanted to store study data inside of data-raw, it would be good to have a designated folder like data-raw/sensitive that would be inside of .gitignore so that git would never try to track real data files that were stored there.

Why bother? Because if two people were to work on the same cohort data, then keeping that cohort data in data-raw/sensitive would mean the same file path works for both people when those data are accessed. This would also be helpful if one person is reviewing a PR from another person and wants to check out the code locally.