Closed boogiegames3 closed 1 year ago
1 epoch on 8 A100s is probably several hours. Depends a bit on what data you use and settings you might change. I don't think the TB logs are lying around from that run, but, just run the training script and watch metrics and progress to 0.01 epochs or something to extraploate.
Interesting, the reason I ask is because seems to only be taking 1 hour for me, which seems suspect :-). I am using the default deepspeed config provided, the default dolly 15k fine-tuning dataset with the default configuration:
!deepspeed 8 \
--module training.trainer \
--input-model EleutherAI/pythia-12b \
--deepspeed {deepspeed_config} \
--epochs 2 \
--local-output-dir {local_output_dir} \
--dbfs-output-dir {dbfs_output_dir} \
--per-device-train-batch-size 6 \
--per-device-eval-batch-size 6 \
--logging-steps 10 \
--save-steps 200 \
--save-total-limit 20 \
--eval-steps 50 \
--warmup-steps 50 \
--test-size 200 \
--lr 5e-6
That's the time for a full epoch? that's faster than I remember, but I could be misremembering as I last did this a while ago. Maybe something improved in a library somewhere.
No, the time for 2 full epochs. It took 1 hour for 2 epochs on 8 x A100 80GB.
Just double checking - that shows about 600 steps, which is 3600 inputs, which isn't nearly a full epoch. Is what's above what you mean or you're definitely sure it was 2 epochs (more like 5000 steps at batch size of 6). If so, then it's likely I just misremember, as I only tried this early on once
That's what I mean.
Am I reading that graph incorrectly or is it wrong?
Oh yes I see and im forgetting you have 8 GPUs so much bigger batches. Nevermind that. I suspect I am just misremembering the time to train. If it works it works!
I see.
Related question, the eval's loss trends upwards with the arguments from the train_dolly.py script, and also even when I make the test-size bigger. That's part of the reason why I was asking if you had the Tensorboard for the training of the published model, it would be great to be able to compare for reproducibility sake.
That looks familiar. Train loss drops at the end of each epoch and is mostly flat. Eval loss may or may not go down at first, but one quickly runs into overfitting with models like this and eval loss increases after an epoch or two. I'm not sure we had a great explanaiton. It seems, generally, like overfitting / memorization. Train loss is better once it starts seeing what it saw before in training. And so eval is getting worse quickly because of that. The answers would be more data or a smaller model really. 12B is just overkill, I think, unless you have a lot of fine-tuning data.
Thank you.
Hi:
I was able to successfully fine-tune with the training script you provided, using the
EleutherAI/pythia-12b
as the base model, but I am trying to verify that I did the same that you did.I am wondering how long is training supposed to run for for 8 x A100.
Can you share the tensorboard board to understand how the training process happened for the model that you published on HuggingFace?
Thanks in advance