Pre-process dataset - Githubissues

How does creating tables for each problem in the mini_dataset sound? That could actually make it a lot easier to work with later down the road.

I.e., for each problem: 1) Take its metadata CSV file using Pandas and select only C, C++, and Python rows from it. 2) Remove unneeded columns from the new DataFrame (e.g., user_id, date, etc…). 3) From the new DataFrame, select a random 10% subset. 4) For each row in the subset, read its corresponding submission file. 5) Append the contents of the submission file to a new column named “solution.” 6) Save the new DataFrame as a feather or pickle file (much smaller and faster than CSV).

At some point, we will have to merge problems this way anyway, so might as well do it from the start.

TrevorAshby / CodeRLHF

Pre-process dataset #3