Closed SeanNaren closed 2 years ago
One thing that might be interesting is to pivot to a code only model. This would make a pretty cool changeup from your traditional models. I'll need to scour a few research papers to see what's commonly used in this space (I assume some cleaned up version of GitHub!)
Just to document, the Pile dataset contains a GitHub portion, however, it seems to be missing from the HF hub.
Hey Sean, any update on what you plan to use for training data here? I'm very interested in this project :)
Yo @nateraw thanks for checking this out man!
I've actually been thinking about training a code-based LM as opposed to your standard the pile/c4 trained on typical text. What do you think @nateraw?
Looking at this dataset currently: https://huggingface.co/datasets/codeparrot/github-code but need to figure out the size of the dataset in tokens for training time estimates/optimality (I retrained a gpt tokenizer on the dataset using tokenizers to get a rough estimate).
I had a look at this dataset: https://huggingface.co/datasets/CodedotAI/code_clippy_github which is much larger however it seems streaming mode fails, maybe because the files are not being recognized as downloadable:
running:
from datasets import load_dataset
ds = load_dataset("CodedotAI/code_clippy_github", streaming=True, split='train')
print(next(iter(ds)))
gives me:
Traceback (most recent call last):
File "/Users/sean.narenthiran/Code/SmallScience/data/test.py", line 4, in <module>
print(next(iter(ds)))
File "/Users/sean.narenthiran/anaconda3/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 599, in __iter__
for key, example in self._iter():
File "/Users/sean.narenthiran/anaconda3/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 579, in _iter
yield from ex_iterable
File "/Users/sean.narenthiran/anaconda3/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 110, in __iter__
yield from self.generate_examples_fn(**self.kwargs)
File "/Users/sean.narenthiran/.cache/huggingface/modules/datasets_modules/datasets/CodedotAI--code_clippy_github/7d50e1f3328c8a1a567b018fdce90807226766cf93ee4877711f93570ae949b2/code_clippy_github.py", line 177, in _generate_examples
with gzip.open(file, "rb") as f:
File "/Users/sean.narenthiran/anaconda3/lib/python3.9/gzip.py", line 58, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/Users/sean.narenthiran/anaconda3/lib/python3.9/gzip.py", line 173, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'https://huggingface.co/datasets/CodedotAI/code_clippy_github/resolve/main/github-dedup-000000000000.json.gz'
I took the code parrot dataset. (Python split), trained retrained a GPT tokenizer and tokenized the entire dataset to get an estimate on dataset size. This gave me ~15 billion tokens. Based on the chinchilla paper, to train a 1 billion parameter model we'll need ~20.2 billion tokens.
Will have to try fix the CoedotAI clippy dataset which is substantially larger.
I've managed to fix the dataset! I've opened a PR here: https://huggingface.co/datasets/CodedotAI/code_clippy_github/discussions/6
I've decided for now to just go for a bog standard NLP dataset that has streaming out the box. I weighed up the use of something more exotic to train a code based LM however to keep this minimal, let's try to re-use as much as possible.
I've been taking a peak at The Pile + C4 which are huge beefy English based datasets.
I also noted that The Pile has streaming support using HF datasets, and if that works that might be a game changer!