TrevorAshby / CodeRLHF

0 stars 0 forks source link

Pre-process dataset #3

Closed TrevorAshby closed 9 months ago

TrevorAshby commented 9 months ago

A file has been added to create mini_dataset, a dataset that is a subset of x% of the original dataset. Dataset just needs to be split, zipped, and uploaded to repo.

faustotnc commented 9 months ago

How does creating tables for each problem in the mini_dataset sound? That could actually make it a lot easier to work with later down the road.

I.e., for each problem: 1) Take its metadata CSV file using Pandas and select only C, C++, and Python rows from it. 2) Remove unneeded columns from the new DataFrame (e.g., user_id, date, etc…). 3) From the new DataFrame, select a random 10% subset. 4) For each row in the subset, read its corresponding submission file. 5) Append the contents of the submission file to a new column named “solution.” 6) Save the new DataFrame as a feather or pickle file (much smaller and faster than CSV).

At some point, we will have to merge problems this way anyway, so might as well do it from the start.