Training Data :mag_right:

SeanNaren commented 2 years ago

I've been taking a peak at The Pile + C4 which are huge beefy English based datasets.

I also noted that The Pile has streaming support using HF datasets, and if that works that might be a game changer!

SeanNaren commented 2 years ago

One thing that might be interesting is to pivot to a code only model. This would make a pretty cool changeup from your traditional models. I'll need to scour a few research papers to see what's commonly used in this space (I assume some cleaned up version of GitHub!)

SeanNaren commented 2 years ago

Just to document, the Pile dataset contains a GitHub portion, however, it seems to be missing from the HF hub.

nateraw commented 2 years ago

Hey Sean, any update on what you plan to use for training data here? I'm very interested in this project :)

SeanNaren commented 2 years ago

Yo @nateraw thanks for checking this out man!

I've actually been thinking about training a code-based LM as opposed to your standard the pile/c4 trained on typical text. What do you think @nateraw?

Looking at this dataset currently: https://huggingface.co/datasets/codeparrot/github-code but need to figure out the size of the dataset in tokens for training time estimates/optimality (I retrained a gpt tokenizer on the dataset using tokenizers to get a rough estimate).

I had a look at this dataset: https://huggingface.co/datasets/CodedotAI/code_clippy_github which is much larger however it seems streaming mode fails, maybe because the files are not being recognized as downloadable:

running:

from datasets import load_dataset
ds = load_dataset("CodedotAI/code_clippy_github", streaming=True, split='train')
print(next(iter(ds)))

gives me:

Traceback (most recent call last):
  File "/Users/sean.narenthiran/Code/SmallScience/data/test.py", line 4, in <module>
    print(next(iter(ds)))
  File "/Users/sean.narenthiran/anaconda3/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 599, in __iter__
    for key, example in self._iter():
  File "/Users/sean.narenthiran/anaconda3/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 579, in _iter
    yield from ex_iterable
  File "/Users/sean.narenthiran/anaconda3/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 110, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/Users/sean.narenthiran/.cache/huggingface/modules/datasets_modules/datasets/CodedotAI--code_clippy_github/7d50e1f3328c8a1a567b018fdce90807226766cf93ee4877711f93570ae949b2/code_clippy_github.py", line 177, in _generate_examples
    with gzip.open(file, "rb") as f:
  File "/Users/sean.narenthiran/anaconda3/lib/python3.9/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/Users/sean.narenthiran/anaconda3/lib/python3.9/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'https://huggingface.co/datasets/CodedotAI/code_clippy_github/resolve/main/github-dedup-000000000000.json.gz'

SeanNaren commented 2 years ago

I took the code parrot dataset. (Python split), trained retrained a GPT tokenizer and tokenized the entire dataset to get an estimate on dataset size. This gave me ~15 billion tokens. Based on the chinchilla paper, to train a 1 billion parameter model we'll need ~20.2 billion tokens.

Will have to try fix the CoedotAI clippy dataset which is substantially larger.

SeanNaren commented 2 years ago

I've managed to fix the dataset! I've opened a PR here: https://huggingface.co/datasets/CodedotAI/code_clippy_github/discussions/6

SeanNaren commented 2 years ago

I've decided for now to just go for a bog standard NLP dataset that has streaming out the box. I weighed up the use of something more exotic to train a code based LM however to keep this minimal, let's try to re-use as much as possible.

SeanNaren / min-LLM

Training Data :mag_right: #7