ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
23 stars 17 forks source link

Add template for parquet datasets #175

Closed klei22 closed 3 months ago

klei22 commented 3 months ago

Adding scripts compatible with:

  1. Canola - english-python dataset from webcrawled data
  2. MMLU-PRO - a benchmark of difficult multple choice questions

Also created

get_json_dataset.py - for getting json datasets automatically from a url get_parquet_dataset.py - for getting parquet datasets automatically from a url

And a helper script: get_dataset.sh from the template folder with prefilled fields for easily getting new datasets.