SkywardAI / paper_gallery

Papers gallery for using LLMs ability over dataset
MIT License
1 stars 0 forks source link

Implement interp-net data preprocessing #7

Closed Aisuko closed 1 month ago

Aisuko commented 1 month ago

We want to implement data preprocessing step of interp-net. The original code was in-efficient. So, let's implement only 100 records as a demo on Kaggle and related to this issue. @Micost

@wangyuweikiwi she will use that notebook as an example and continue to work.

https://github.com/mlds-lab/interp-net/blob/af2dbb8a23ba3584706c079432cc00568c68fd99/src/multivariate_example.py#L92-L111

There are two files we need to carefully deal with inside load_data function. Please check the notebook was done by me. https://github.com/SkywardAI/mimic_automatic/blob/main/interp_net/load_data.ipynb

So, you can see that the adm_type_los_mortality.p file is here https://github.com/SkywardAI/mimic_automatic/blob/main/data_extraction/adm_type_los_mortality.p

However, the vitals_records.p is so large size, and there is no reason we need to load all the data due to we already have tons of in-efficient code from that project. So, I split them to every 5000 records a checkpoint. https://huggingface.co/datasets/aisuko/mimic_iii_data_extraction

Note: vitals_records_1000.p and vitals_records_2000.p these two files are the smaller batch examples for me testing multi-processing code, please ignore them.

Micost commented 1 month ago

@wangyuweikiwi

Please take a look at this note

https://www.kaggle.com/micost/mimic-interp-net-data-preprocessing

I update the loading process. Since the vital file has already sliced into several files. Please increment value batch_idx if you want to load the splited file one by one.