Open khaerulumam42 opened 5 years ago
Hi there! There are scripts in the dataset directory that roughly show what needs to be done, but it's all super rough I'm sorry, I still haven't found the time to refactor everything. Basically you need to convert your text files into tfrecords using the create_tfrecords.py script (you'll need to modify it by hand to pick up your text files). Then you place those in a google storage bucket. Finally you need to modify the input function (or create a new one) in inputs.py. You can see how it works by looking at the openwebtext function. You just need to create a list of your train and eval file names and pass them as shown to the bpe_text function, which returns a TF dataset that your training can use.
Hope that helps.
Yes I create some function using my dataset and it works thanks. I will try to pr to make your code more flexible. I have some questions about your code:
inputs.py
? I try change stitch value from 42 into 2 solve my error OutOfRange
, but error still appear when iterations reach 10000Thank you :)
Thanks, I'm hoping to find some time this week to polish some code and write a better tutorial, we'll see. Looking forward to your PR!
Hope that helps!
please make sample colab notebook on data preprocess i am also getting OutOfRange error . I tried to change dataset also and stitch values but didn't work for me.
Hi @kbrajwani , I'm afraid I do not maintain this repo anymore. I would recommend using the Hugging Face transformers library instead. Good luck!
hey @ConnorJL , No problem. I tried to do that see https://github.com/huggingface/transformers/issues/6672. They have some issues with tpu.
Wow nice repository, I also find GPT2 repo to train on TPU because I just got access google cloud TPU from tensorflow research cloud program. I have a plain text dataset but I don't know how to reformat my dataset into trainable format dataset like in your repo. So any formatted dataset you create to trian using this repo?
Very thanks for your answer and create this repo, awesome!