Closed abheesht17 closed 2 years ago
So we are now saving the dataset dataframes (df_{train,dev,test}) into json files?
Yes, since we tokenize the text. If we use a dataframe, we would have to store a list in the dataframe, which will be stored as an object
(i.e., string
).
Is Pandas dataframe can be saved as json? When I tried to save a test pandas dataframe, it failed with an error
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
with open('temp.json', 'w') as fd:
json.dump(df2, fd)
gives
TypeError: Object of type DataFrame is not JSON serializable
Is Pandas dataframe can be saved as json? When I tried to save a test pandas dataframe, it failed with an error
df2 = pd.DataFrame( { "A": 1.0, "B": pd.Timestamp("20130102"), "C": pd.Series(1, index=list(range(4)), dtype="float32"), "E": pd.Categorical(["test", "train", "test", "train"]), "F": "foo", } ) with open('temp.json', 'w') as fd: json.dump(df2, fd)
gives
TypeError: Object of type DataFrame is not JSON serializable
I converted it to a dictionary before saving it as a json file.
# convert dataset to dictionary
train_df = train_df.to_dict(orient="list")
val_df = val_df.to_dict(orient="list")
test_df = test_df.to_dict(orient="list")
(Lines 196-199 in modules/preprocessing_pipeline.py
)
Oh. I totally missed it. Then it should be fine. Let's get this merged!
Oh. I totally missed it. Then it should be fine. Let's get this merged!
@dalgu90 , thanks! Could you please submit your approval?
[WIP]
This PR aims to accomplish two main tasks:
Optimising/Debugging Preprocessing
pd.groupby()
andpd.merge()
to accomplish this. Running time reduced from 10 minutes to 15-30 seconds.combine_code_and_notes()
.Introduce Class for Word2Vec
Note: The official CAML data preprocessing has errors...especially when it comes to preprocessing the ICD codes. Need to look into this.