dalgu90 / icd-coding-benchmark

Automatic ICD coding benchmark based on the MIMIC dataset
MIT License
35 stars 5 forks source link

Optimise/debug preprocessing, add Word2Vec embedding trainer/loader #13

Closed abheesht17 closed 2 years ago

abheesht17 commented 2 years ago

[WIP]

This PR aims to accomplish two main tasks:

Optimising/Debugging Preprocessing

Introduce Class for Word2Vec

Note: The official CAML data preprocessing has errors...especially when it comes to preprocessing the ICD codes. Need to look into this.

abheesht17 commented 2 years ago

So we are now saving the dataset dataframes (df_{train,dev,test}) into json files?

Yes, since we tokenize the text. If we use a dataframe, we would have to store a list in the dataframe, which will be stored as an object (i.e., string).

dalgu90 commented 2 years ago

Is Pandas dataframe can be saved as json? When I tried to save a test pandas dataframe, it failed with an error

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

 with open('temp.json', 'w') as fd:
     json.dump(df2, fd)

gives

TypeError: Object of type DataFrame is not JSON serializable
abheesht17 commented 2 years ago

Is Pandas dataframe can be saved as json? When I tried to save a test pandas dataframe, it failed with an error

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

 with open('temp.json', 'w') as fd:
     json.dump(df2, fd)

gives

TypeError: Object of type DataFrame is not JSON serializable

I converted it to a dictionary before saving it as a json file.

        # convert dataset to dictionary
        train_df = train_df.to_dict(orient="list")
        val_df = val_df.to_dict(orient="list")
        test_df = test_df.to_dict(orient="list")

(Lines 196-199 in modules/preprocessing_pipeline.py)

dalgu90 commented 2 years ago

Oh. I totally missed it. Then it should be fine. Let's get this merged!

abheesht17 commented 2 years ago

Oh. I totally missed it. Then it should be fine. Let's get this merged!

@dalgu90 , thanks! Could you please submit your approval?