Closed StephennFernandes closed 2 years ago
What it matters is the total number of tokens in your dataset. 60M documents at, let's say, an average of 100 tokens per document, makes 6B tokens. If you are training on sequence lengths of 512 and batch size of 64 per core in a TPUv3-8 con 8 cores, then to consume 1 epoch you need 6B/(512*64*8) ~ 23k
steps. If you train for 500k steps then you would be iterating over your dataset 21 times (epochs).
Read the original T5 paper for guidance on how many steps to train. In general, the released models were trained using a total batch size of 2048 for a total of 1 trillion tokens if I remember correctly.
@versae I had a doubt regarding how much training steps to train the model, given a custom training datasets.
Currently i am training T5_1_1 on hindi language, and i current have dataset of 20GB 60M+ samples. but on training for 500k steps on batch_size of 64, the trainer says its training for 250 epochs.
(not sure on how the math in trainer works for estimating the epochs. as i have 60M+ samples training on batch_size 64, How could it readch 250 epochs in 500k steps ? )
could you please tell me on how much epochs is it ideal and recommended to train the model. (i have seen in t5x repo people reporting bad downstream task performance after the model was trained too long)