We have raw meds format data in parquet files in some directory that look like this. The goal is to run the meds-transform processing pipeline to convert this to a JNRT with dynamic data and parquet files with static data.
This bash script creates the synthetic test data, and it has two steps
[x] python tests/helpers/extract_test_data.py this will generate raw csvs and convert them into meds format. Add a dummy text_value here to the raw csvs in the folder test_data/MEDS_Cohort/data.
Here's a rough illustration of the process of converting raw EHR data to MEDS format:
[x] python tests/helpers/generate_test_data_tensors.py will then convert this meds-format data into JNRTs.
The above stages are preprocessing steps applied to the MEDS table.
Fit vocab indices defines the mapping from codes (BP, HR) and to vocabulary indices.
Normalization will actually map the codes to indices, and z-score normalize numeric values.
tokenization will create the static data in the directory tokenization/schemas and dynamic data in tokenization/event_seqs.
The dynamic data are those which have timestamps that are not null. The static data, if you look at the codes in the static data table, those codes are symbolic for the quantities that are static (e.g. eye-color).
The bolded stages with the prefix custom_ need to be implemented.
[x] for custom_normalization modify the meds-transform normalization.py transform. It currently does not propagate the text_value column.
[x] for custom_text_tokenization use a tokenizer on the text_value column to make a 3 level nested list with tokenized text in the text_value column. Modify the meds-transform tokenization.py
Recall that the meds-transform tokenization.py script will create a 2 level nested list for codes at each timestamp. Each code will have a list of text associated (most of the time this is just an empty list. Look at this colab demo with the JNRT here to see the structure of the data.
[x] for custom_text_tensorization add support for creating the JNRT with the tokenized text data. Modify tensorize.py from meds-transform
We have raw meds format data in parquet files in some directory that look like this. The goal is to run the meds-transform processing pipeline to convert this to a JNRT with dynamic data and parquet files with static data.
This bash script creates the synthetic test data, and it has two steps
[x]
python tests/helpers/extract_test_data.py
this will generate raw csvs and convert them into meds format. Add a dummy text_value here to the raw csvs in the foldertest_data/MEDS_Cohort/data
. Here's a rough illustration of the process of converting raw EHR data to MEDS format:[x]
python tests/helpers/generate_test_data_tensors.py
will then convert this meds-format data into JNRTs.The above stages are preprocessing steps applied to the MEDS table.
tokenization/schemas
and dynamic data intokenization/event_seqs
.The bolded stages with the prefix
custom_
need to be implemented.[x] for
custom_normalization
modify the meds-transform normalization.py transform. It currently does not propagate thetext_value
column.[x] for
custom_text_tokenization
use a tokenizer on the text_value column to make a 3 level nested list with tokenized text in thetext_value
column. Modify the meds-transform tokenization.pyRecall that the meds-transform
tokenization.py
script will create a 2 level nested list for codes at each timestamp. Each code will have a list of text associated (most of the time this is just an empty list. Look at this colab demo with the JNRT here to see the structure of the data.[x] for
custom_text_tensorization
add support for creating the JNRT with the tokenized text data. Modify tensorize.py from meds-transform