databrickslabs / dolly

Databricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform
https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
Apache License 2.0
10.81k stars 1.16k forks source link

How long does it take for training 2 epochs on 8 x NVIDIA A100 #171

Closed boogiegames3 closed 1 year ago

boogiegames3 commented 1 year ago

Hi:

I was able to successfully fine-tune with the training script you provided, using the EleutherAI/pythia-12b as the base model, but I am trying to verify that I did the same that you did.

I am wondering how long is training supposed to run for for 8 x A100.

Can you share the tensorboard board to understand how the training process happened for the model that you published on HuggingFace?

Thanks in advance

srowen commented 1 year ago

1 epoch on 8 A100s is probably several hours. Depends a bit on what data you use and settings you might change. I don't think the TB logs are lying around from that run, but, just run the training script and watch metrics and progress to 0.01 epochs or something to extraploate.

boogiegames3 commented 1 year ago

Interesting, the reason I ask is because seems to only be taking 1 hour for me, which seems suspect :-). I am using the default deepspeed config provided, the default dolly 15k fine-tuning dataset with the default configuration:

!deepspeed 8 \
     --module training.trainer \
     --input-model EleutherAI/pythia-12b \
     --deepspeed {deepspeed_config} \
     --epochs 2 \
     --local-output-dir {local_output_dir} \
     --dbfs-output-dir {dbfs_output_dir} \
     --per-device-train-batch-size 6 \
     --per-device-eval-batch-size 6 \
     --logging-steps 10 \
     --save-steps 200 \
     --save-total-limit 20 \
     --eval-steps 50 \
     --warmup-steps 50 \
     --test-size 200 \
     --lr 5e-6
Screenshot 2023-05-17 at 8 00 01 AM Screenshot 2023-05-17 at 8 02 47 AM
srowen commented 1 year ago

That's the time for a full epoch? that's faster than I remember, but I could be misremembering as I last did this a while ago. Maybe something improved in a library somewhere.

boogiegames3 commented 1 year ago

No, the time for 2 full epochs. It took 1 hour for 2 epochs on 8 x A100 80GB.

srowen commented 1 year ago

Just double checking - that shows about 600 steps, which is 3600 inputs, which isn't nearly a full epoch. Is what's above what you mean or you're definitely sure it was 2 epochs (more like 5000 steps at batch size of 6). If so, then it's likely I just misremember, as I only tried this early on once

boogiegames3 commented 1 year ago

That's what I mean.

Screenshot 2023-05-17 at 3 47 13 PM

Am I reading that graph incorrectly or is it wrong?

srowen commented 1 year ago

Oh yes I see and im forgetting you have 8 GPUs so much bigger batches. Nevermind that. I suspect I am just misremembering the time to train. If it works it works!

boogiegames3 commented 1 year ago

I see.

Related question, the eval's loss trends upwards with the arguments from the train_dolly.py script, and also even when I make the test-size bigger. That's part of the reason why I was asking if you had the Tensorboard for the training of the published model, it would be great to be able to compare for reproducibility sake.

Screenshot 2023-05-18 at 9 44 26 AM
srowen commented 1 year ago

That looks familiar. Train loss drops at the end of each epoch and is mostly flat. Eval loss may or may not go down at first, but one quickly runs into overfitting with models like this and eval loss increases after an epoch or two. I'm not sure we had a great explanaiton. It seems, generally, like overfitting / memorization. Train loss is better once it starts seeing what it saw before in training. And so eval is getting worse quickly because of that. The answers would be more data or a smaller model really. 12B is just overkill, I think, unless you have a lot of fine-tuning data.

boogiegames3 commented 1 year ago

Thank you.