EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.21k stars 945 forks source link

Dataset preparation #257

Closed BakingBrains closed 2 years ago

BakingBrains commented 2 years ago

Can you please suggests, how can I prepare the dataset for code geenration task? or the data is prepared as same as for text generation task?.

StellaAthena commented 2 years ago

It is prepared the same as the text generation task. You may find improved performance using a customized tokenizer though, as normal text tokenization does not particularly support the syntax of code well.