Complete synthetic testing data

We have raw meds format data in parquet files in some directory that look like this. The goal is to run the meds-transform processing pipeline to convert this to a JNRT with dynamic data and parquet files with static data.

This bash script creates the synthetic test data, and it has two steps

[x] python tests/helpers/extract_test_data.py this will generate raw csvs and convert them into meds format. Add a dummy text_value here to the raw csvs in the folder test_data/MEDS_Cohort/data. Here's a rough illustration of the process of converting raw EHR data to MEDS format:
[x] python tests/helpers/generate_test_data_tensors.py will then convert this meds-format data into JNRTs.

"stages": [
            "aggregate_code_metadata",
            "filter_subjects",
            "add_time_derived_measurements",
            "filter_measurements",
            "occlude_outliers",
            "fit_vocabulary_indices",
            **"custom_normalization",**
            **"custom_text_tokenization",**
            **"custom_text_tensorization",**
        ],

The above stages are preprocessing steps applied to the MEDS table.

Fit vocab indices defines the mapping from codes (BP, HR) and to vocabulary indices.
Normalization will actually map the codes to indices, and z-score normalize numeric values.
tokenization will create the static data in the directory tokenization/schemas and dynamic data in tokenization/event_seqs.
The dynamic data are those which have timestamps that are not null. The static data, if you look at the codes in the static data table, those codes are symbolic for the quantities that are static (e.g. eye-color).

The bolded stages with the prefix custom_ need to be implemented.

[x] for custom_normalization modify the meds-transform normalization.py transform. It currently does not propagate the text_value column.
[x] for custom_text_tokenization use a tokenizer on the text_value column to make a 3 level nested list with tokenized text in the text_value column. Modify the meds-transform tokenization.py

Recall that the meds-transform tokenization.py script will create a 2 level nested list for codes at each timestamp. Each code will have a list of text associated (most of the time this is just an empty list. Look at this colab demo with the JNRT here to see the structure of the data.
[x] for custom_text_tensorization add support for creating the JNRT with the tokenized text data. Modify tensorize.py from meds-transform

Oufattole / meds-torch

Complete synthetic testing data #54