Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.95k stars 733 forks source link

Run evaluation at end of training #1332

Closed awaelchli closed 1 month ago

awaelchli commented 1 month ago

Runs a final validation loop at end of training, over the entire validation set. Also refactors out the generated sample into a separate function, so we can run the validation loop without it at the end.

rasbt commented 1 month ago

That's very nice. I tried to do something similar in #1228 but I like your approach much better. I suggest closing #1228 in favor of your PR.

The other thing is, what do you think about computing the initial validation set loss, too? The reason is

1) some users got confused about the n/a we currently print.

2) it's useful to know how much the model improved from the beginning.

awaelchli commented 1 month ago

@rasbt I wasn't aware of your PR. It's doing a lot of different things that could be discussed separately. For this one I just needed the final validation loss to be able to put it into the benchmark table.

rasbt commented 1 month ago

@rasbt I wasn't aware of your PR. It's doing a lot of different things that could be discussed separately. For this one I just needed the final validation loss to be able to put it into the benchmark table.

No worries, you can ignore the other PR. I can rebase it based on this PR later.

rasbt commented 1 month ago

Since the newly added lines are not very wide, should we spell out ppl as perplexity? It might be less ambiguous to a general audience.

poch 1 | iter 74 step 4 | loss train: 1.552, val: n/a | iter time: 134.38 ms
Epoch 1 | iter 75 step 4 | loss train: 1.623, val: n/a | iter time: 134.42 ms
Epoch 1 | iter 76 step 4 | loss train: 1.517, val: n/a | iter time: 150.71 ms
Epoch 1 | iter 77 step 4 | loss train: 1.562, val: n/a | iter time: 148.68 ms
Epoch 1 | iter 78 step 4 | loss train: 1.546, val: n/a | iter time: 161.03 ms
Epoch 1 | iter 79 step 4 | loss train: 1.655, val: n/a | iter time: 154.14 ms
Epoch 1 | iter 80 step 5 | loss train: 1.623, val: n/a | iter time: 165.89 ms (step)
Training time: 15.71s
Memory used: 19.36 GB
Validating ...
Final evaluation | val loss: 1.6684 | val perplexity: 5.3034