(v1) GPT-2 training and evaluation code

SudharsanSundar commented 1 year ago

(v1) Allows you to train custom architecture GPT-2 model for a specific number of tokens from scratch, then evaluate the perplexity (and accuracy), using HuggingFace (iterable/streaming) datasets.

Basic testing done on training and evaluation;
More thorough testing done on training; evaluation pending

brando90 commented 1 year ago

@SudharsanSundar where are the details for how to download the data sets? e.g., PubMet and USPTO. This is really important is clear.

brando90 commented 1 year ago

@SudharsanSundar did you document how you choose the number of data points training setting to have the same number of tokens on all experiments? if not can we write a .md for that or add it to the readme so we can have reproducible experiments?

SudharsanSundar commented 1 year ago

See cache_datasets.py (https://github.com/SudharsanSundar/beyond-scale-SS-fork/blob/main/src/train/SS-train-code/cache_datasets.py).

Particularly lines 27 and 28 are what I used to cache (i.e. download locally) the datasets (for use in HF datasets format) (https://github.com/SudharsanSundar/beyond-scale-SS-fork/blob/d45faf2a3e91c43cd6b3f2e85aa7ebef9b169aad/src/train/SS-train-code/cache_datasets.py#L27)

Yup, I believe I have! See the README in the SS-train-code folder, particularly bullets 2 and 3 under "Hence, the important decisions for training were" (ctrl f this phrase and read the bullets underneath for a relatively thorough overview).

Feel free to lmk if there's anything unclear there, I'd be happy to clarify! In general, I go through my experimental setup decisions in a good amount of detail in the README, so I think it'll serve as a good reference if you have questions about how I did my experiments!

allyyc commented 1 year ago

Hi! I am currently working with Brando on his project and am utilizing this code. Is 'SudharsanSundar/PileSubsets' still up? I can't seem to find it on HuggingFace, and I am getting a FileNotFoundError. Thank you so much!

alycialee / beyond-scale-language-data-diversity

(v1) GPT-2 training and evaluation code #10