CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
Apache License 2.0
3.3k stars 224 forks source link

**Code Tokenization** #6

Open ncoop57 opened 3 years ago

ncoop57 commented 3 years ago
bentrevett commented 3 years ago

Here's a paper which has a few experiments on different tokenization strategies on code: https://arxiv.org/abs/2004.13651.

Subtokenization, i.e. splitting tokens on camelCase, underscores, hyphens, etc., and BPE (byte-pair encoding) tokenization both seem to do pretty well. The subtokenization method, however, need to be manually written -- potentially error prone -- whereas BPE tokenization could use the HuggingFace Tokenizers (https://github.com/huggingface/tokenizers) repo which is pretty good. Existing BPE tokenizers trained on code: https://huggingface.co/microsoft/codebert-base, https://huggingface.co/huggingface/CodeBERTa-small-v1, https://huggingface.co/huggingface/CodeBERTa-language-id. Although once the dataset is collected a BPE tokenizer can be trained on that.

bentrevett commented 3 years ago

Microsoft's Deep Program Understanding (DPU) group has a "utils" repo: https://github.com/microsoft/dpu-utils

In the repo they have a code snippet for splitting identifiers on camelCase and snake_case which I think is pretty good, see: https://github.com/microsoft/dpu-utils/blob/master/python/dpu_utils/codeutils/identifiersplitting.py

Can be adapted for kebab-case by adding identifier = identifier.replace('-', '_') just inside the split_identifier_into_parts function (could probably be also done by just editing the regex but I am not good enough at regex to figure out how).

Another thing they also have is the ability to get the list of keywords for each language, which might be useful for BPE tokenization as we can add the keywords to the list of words that we don't split into parts -- although they should appear frequently enough that the BPE tokenization doesn't split them anyway.

neubig commented 3 years ago

If we're using any of the GPT-Neo pre-trained models, I think we're basically stuck with their tokenization (which is undoubtably some variety of automatically induced subwords).

If we roll our own segmentation then BPE trained on whatever data we collect would be my suggestion.

Important: We need to treat white-space as a first-class citizen and make sure it gets allocated tokens appropriately. Even vertical white space can have an effect on what code should be predicted next: https://twitter.com/miltos1/status/1410663145052442629