burstein-lab / genomic-nlp-server

1 stars 0 forks source link

Encode columns in order to save space and load faster #319

Closed notofir closed 10 months ago

notofir commented 10 months ago

See size of columns by running memory_usage = df.memory_usage(deep=True) / 1024 / 1024

notofir commented 10 months ago

Suggested columns by @dudubur: ["prediction_summary", "color", "predicted_class"]

notofir commented 10 months ago

Resolved by:

df = df.drop(columns=['label'])
df["color"] = df["color"].astype("category")
df["predicted_class"] = df["predicted_class"].astype("category")

unique_strings = list(set(key for row in df['prediction_summary'] for key in row))
string_to_int_mapping = {string: idx for idx, string in enumerate(unique_strings)}
int_to_string_mapping = {idx: string for string, idx in string_to_int_mapping.items()}
df['prediction_summary'] = df['prediction_summary'].apply(lambda d: {string_to_int_mapping[key]: value for key, value in d.items()})
int_to_string_mapping = {value: key for key, value in string_to_int_mapping.items()}
with open('prediction_summary_key_encoding.pkl', 'wb') as f:
    pickle.dump(int_to_string, f)
dudubur commented 10 months ago

@notofir Show me what you've got... memory_usage = df.memory_usage(deep=True) / 1024 / 1024

notofir commented 10 months ago

I'm sorry but the complex part didn't really work. It's problematic that the prediction_summary column is dict. I'm exporting this to a different df.

notofir commented 10 months ago
df["color"] = df["color"].astype("category")
df["predicted_class"] = df["predicted_class"].astype("category")
new_df = df['word', 'prediction_summary'].copy()
new_df = new_df.reset_index(drop=True)
new_df.to_pickle("prediction_summary.pkl")

df = df.drop(columns=["label", "prediction_summary"])
df = df.reset_index(drop=True)
df.to_pickle("model_data.pkl")
notofir commented 10 months ago

Resolved by #320.