bertin-project / bertin-t5x

BERTIN Project T5X training files
Apache License 2.0
3 stars 2 forks source link

How does the t5x pretraining script understands how to load the appropriate "train", "validation" hf dataset ? #1

Closed StephennFernandes closed 2 years ago

StephennFernandes commented 2 years ago

@versae Hey there, I am too pretraining T5_1_1 base using a custom generated datasets that's in huggingface. your repo has been a huge help in making HF dataset compatible.

upon making a task.py just like yours i have started pretraining a base t5_1_1 but i had a doubt, how does the t5x training script know how to load the train subset and the validation subsets ?

also one weird thing i noticed is that the training is quite too fast, my dataset is 20GB containing 60M+ samples. and its running on 2 x A6000 96GB of VRAM in total, with 64 GB of memory. and upon 2 hrs the training has already completed 10k steps. which is a bit confusing for a model of that size.

it crossed my mind that maybe the model might be training on "validation" samples and not on the actual training data.

versae commented 2 years ago

Hi @StephennFernandes,

Happy to read it's been useful to you. The training script expects a dataset with two splits, train and validation which map 1 to 1 to Hugginface datasets, as done here https://github.com/bertin-project/bertin-t5x/blob/main/tasks.py#L58

The base model is bit really that big, around 220M params. I train a TPUv3-8s and is also relatively fast. So maybe on A6000's with 48G of VRAM is also fast enough. In any case, you need to carefully read the original paper and check your batch size so you know exactly how many steps to train your model on.

StephennFernandes commented 2 years ago

@versae thanks a ton for replying and clearing my doubts. Also could you please tell me how do you go about with the vocabulary. In the T5 task.py I see that they use a default tokenizer, whole you in task.py point to a gs bucket location.

Also after completing a test run I really couldn't find any sperate tokenizer file, not sure on where the tokenizer file was saved/created.

Could you please tell me a bit in detail about the vocabulary

versae commented 2 years ago

If you want to use your own vocabulary you must do pre-train of your model. Here's an example of how to load your own tokenizer/vocabulary: https://github.com/bertin-project/bertin-t5x/blob/main/pretrain_t5_1_1_base.gin#L11 But beware, you also need to specify exactly the same vocab in your task: https://github.com/bertin-project/bertin-t5x/blob/main/tasks.py#L53

In the current tasks.py code, the default vocab uses whatever is defined in the original gin file. If doing finetuning, it will be the vocab used to pre-train the the model.

StephennFernandes commented 2 years ago

@versae, actually i am a bit confused with what line 12 and 13 is in the gin file means, https://github.com/bertin-project/bertin-t5x/blob/main/pretrain_t5_1_1_base.gin#L12

as per my assumptions, VOCABULARY = @seqio.SentencePieceVocabulary() specifies to create a vocab using a seqio method and seqio.SentencePieceVocabulary.sentencepiece_model_file = "gs://bertin-project/t5/vocabs/oscar/es_32000_bpe.sp.model" specifies to save the sentencepiece model file to the corresponding location.

so in my case, as i am training locally. i would specifiy a location to save the corresponding sentencepiece.model file once the tokenizer is train right ?

eg: seqio.SentencePieceVocabulary.sentencepiece_model_file = "pretrain_model/my_custom_32000_bpe.sp.model"

am i right here ?

VOCABULARY = @seqio.SentencePieceVocabulary()
seqio.SentencePieceVocabulary.sentencepiece_model_file = "gs://bertin-project/t5/vocabs/oscar/es_32000_bpe.sp.model"
seqio.SentencePieceVocabulary.extra_ids = 100

also please could you also tell me on what does extra_ids = 100 mean ? and how to configure to create a vocab with more than 32000 tokens

StephennFernandes commented 2 years ago

@versae okay so i went through your gin files, and i could see different types of tokenizers. could you please show me on how did you configure and train your tokenizers before pretraining.

Is there a seperate procedure to generate a tokenizer .model file. and then use it in the .gin file for pretraining ?

versae commented 2 years ago

Yes, there is. You need to use, for example, SentencePiece.

StephennFernandes commented 2 years ago

@versae, thanks for replying. based on your training which tokenizer gave you the best results ? and which tokenizer would you recommend to go with.

Actually, ill be pretraining mt5 on 23 indian languages. So what would it be better hparams for the tokenizer to train with ?

versae commented 2 years ago

Hi, if you are using mT5, then there is no need to train your own tokenizer. You can just use the mT5 tokenizer as long as the characters are recognized by mT5. If they are not, then you'll need to train your own. There's some evidence in that using a pre-trained mT5 or an English T5 model and then change to another tokenizer might report some gains. In my experience, that's not really the case when the languages are too different.

Regarding which one is better. I saw no significant difference between Unigram or BPE.