format dataset - Githubissues

khaerulumam42 commented 5 years ago

Wow nice repository, I also find GPT2 repo to train on TPU because I just got access google cloud TPU from tensorflow research cloud program. I have a plain text dataset but I don't know how to reformat my dataset into trainable format dataset like in your repo. So any formatted dataset you create to trian using this repo?

Very thanks for your answer and create this repo, awesome!

ConnorJL commented 5 years ago

Hi there! There are scripts in the dataset directory that roughly show what needs to be done, but it's all super rough I'm sorry, I still haven't found the time to refactor everything. Basically you need to convert your text files into tfrecords using the create_tfrecords.py script (you'll need to modify it by hand to pick up your text files). Then you place those in a google storage bucket. Finally you need to modify the input function (or create a new one) in inputs.py. You can see how it works by looking at the openwebtext function. You just need to create a list of your train and eval file names and pass them as shown to the bpe_text function, which returns a TF dataset that your training can use.

Hope that helps.

khaerulumam42 commented 5 years ago

Yes I create some function using my dataset and it works thanks. I will try to pr to make your code more flexible. I have some questions about your code:

What stitch mean on inputs.py? I try change stitch value from 42 into 2 solve my error OutOfRange, but error still appear when iterations reach 10000
I split my 2GB dataset into 10MB txt file each file, and some file cannot convert to tfrecords data, I havent figure out why it happens, any rule to make tfrecords data?

Thank you :)

ConnorJL commented 5 years ago

Thanks, I'm hoping to find some time this week to polish some code and write a better tutorial, we'll see. Looking forward to your PR!

To train the model, you need to feed it chunks of text that are n_ctx+1 long. Since most text won't be that long, I concatenate multiple texts with "<|endoftext|>" between to reach that length. Stitch determines how many such texts are loaded and concatenated, before slicing n_ctx+1 symbols out of it. That means stitch must be set so that: (minimum-length-of-you-texts * stitch) >= n_ctx+1 Setting that correctly should fix your OutOfRange error. Note that "length" is in BPE tokens, not Unicode symbols.
By default, the script throws out files that are composed only of zeros or smaller than a certain length (25 BPE tokens by default iirc). It should also throw out things out that throw an error during reading or ftfy's fixing process. So I would expect the text files are either too small or contain some kind of totally corrupt unicode. You can see the encoding/writing process here.

Hope that helps!

kbrajwani commented 3 years ago

please make sample colab notebook on data preprocess i am also getting OutOfRange error . I tried to change dataset also and stitch values but didn't work for me.

ConnorJL commented 3 years ago

Hi @kbrajwani , I'm afraid I do not maintain this repo anymore. I would recommend using the Hugging Face transformers library instead. Good luck!

kbrajwani commented 3 years ago

hey @ConnorJL , No problem. I tried to do that see https://github.com/huggingface/transformers/issues/6672. They have some issues with tpu.

ConnorJL / GPT2

format dataset #13