Closed SudharsanSundar closed 9 months ago
@SudharsanSundar where are the details for how to download the data sets? e.g., PubMet and USPTO. This is really important is clear.
@SudharsanSundar did you document how you choose the number of data points training setting to have the same number of tokens on all experiments? if not can we write a .md for that or add it to the readme so we can have reproducible experiments?
Particularly lines 27 and 28 are what I used to cache (i.e. download locally) the datasets (for use in HF datasets format) (https://github.com/SudharsanSundar/beyond-scale-SS-fork/blob/d45faf2a3e91c43cd6b3f2e85aa7ebef9b169aad/src/train/SS-train-code/cache_datasets.py#L27)
Feel free to lmk if there's anything unclear there, I'd be happy to clarify! In general, I go through my experimental setup decisions in a good amount of detail in the README, so I think it'll serve as a good reference if you have questions about how I did my experiments!
Hi! I am currently working with Brando on his project and am utilizing this code. Is 'SudharsanSundar/PileSubsets' still up? I can't seem to find it on HuggingFace, and I am getting a FileNotFoundError. Thank you so much!
(v1) Allows you to train custom architecture GPT-2 model for a specific number of tokens from scratch, then evaluate the perplexity (and accuracy), using HuggingFace (iterable/streaming) datasets.