Open Oufattole opened 1 month ago
I think this is essential. Right now, there are > 100 files of just sample data that you rely on in tests, which significantly complicates the repository and introduces brittleness. I don't think you necessarily want to generate all data on every test, though -- rather; what would be better is either adding the data generation step to the github workflows for the integration test, so it happens before any tests are run, or simplifying the sample data so that it will not add any time costs and / or simplifying the integration tests so fewer tests need these files.
We currently have a script that generates MEDS data and JNRTs from raw CSVs of dummy EHR data. We can integrate this into the meds_dir package level scope pytest fixture that sets up the input data directories.
Currently, the
meds_dir
fixture just copies the MEDS data and JNRTs from the GitHub repo to a temporary directory. Instead, we can directly transform the data from raw dummy CSVs to MEDS data and JNRTs in the temporary directory once at the beginning of each pytest call.To implement this, we would move a single call to
tests/helpers/generate_test_data.sh
to the beginning of this fixture.Thoughts @mmcdermott?