Optimise/debug preprocessing, add Word2Vec embedding trainer/loader

dalgu90 / icd-coding-benchmark

Automatic ICD coding benchmark based on the MIMIC dataset

MIT License

35 stars 5 forks source link

Optimise/debug preprocessing, add Word2Vec embedding trainer/loader #13

Closed abheesht17 closed 2 years ago

abheesht17 commented 2 years ago

[WIP]

This PR aims to accomplish two main tasks:

Optimising/Debugging Preprocessing

[x] Combining NOTEEVENTS and CODE csv files should be more optimal: Used pd.groupby() and pd.merge() to accomplish this. Running time reduced from 10 minutes to 15-30 seconds.
[x] Debug combine_code_and_notes().
[x] Debug top-k code filtering.

Introduce Class for Word2Vec

[x] Add Word2Vec embedding method: This should be very general. Should have functionality for loading/training Word2Vec. Many automatic ICD coding approaches use Word2Vec for embedding layer.

Note: The official CAML data preprocessing has errors...especially when it comes to preprocessing the ICD codes. Need to look into this.

abheesht17 commented 2 years ago

So we are now saving the dataset dataframes (df_{train,dev,test}) into json files?

Yes, since we tokenize the text. If we use a dataframe, we would have to store a list in the dataframe, which will be stored as an object (i.e., string).

dalgu90 commented 2 years ago

Is Pandas dataframe can be saved as json? When I tried to save a test pandas dataframe, it failed with an error

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

 with open('temp.json', 'w') as fd:
     json.dump(df2, fd)

gives

TypeError: Object of type DataFrame is not JSON serializable

abheesht17 commented 2 years ago

Is Pandas dataframe can be saved as json? When I tried to save a test pandas dataframe, it failed with an error

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

 with open('temp.json', 'w') as fd:
     json.dump(df2, fd)

gives

TypeError: Object of type DataFrame is not JSON serializable

I converted it to a dictionary before saving it as a json file.

        # convert dataset to dictionary
        train_df = train_df.to_dict(orient="list")
        val_df = val_df.to_dict(orient="list")
        test_df = test_df.to_dict(orient="list")

(Lines 196-199 in modules/preprocessing_pipeline.py)

dalgu90 commented 2 years ago

Oh. I totally missed it. Then it should be fine. Let's get this merged!

abheesht17 commented 2 years ago

Oh. I totally missed it. Then it should be fine. Let's get this merged!

@dalgu90 , thanks! Could you please submit your approval?