JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.3k stars 100 forks source link

Reproduce the result when freezing parameters #15

Closed sleeepeer closed 1 year ago

sleeepeer commented 1 year ago

Hi, thank you for this wonderful work.

I met with some troubles when reproducing the head only results. I mean, I can reproduce your results on end-to-end tuning, but when I freeze the BERT (encoder) parameters and only tune the classification head, the result can not be as good as your checkpoint.

The SST-2 accuracy of your checkpoint at https://huggingface.co/JonasGeiping/crammed-bert is 0.922 (end-to-end) and 0.918 (head only) in my reproduction. The bert-base-uncased (from HuggingFace) accuracy is 0.931 (end-to-end) and 0.930 (head only).

I downloaded the c4-subset-processed from your dropbox link and I replicated your work by running:

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

The end-to-end accuracy on SST-2 is 0.922 but the head only acuuracy is only 0.784. I'm wondering why I got this problem.

I freeze the encoder parameters by:

for param in model.encoder.parameters():
    param.requires_grad = False

I also want to know how the checkpoint at https://huggingface.co/JonasGeiping/crammed-bert was trained? Was it trained by running the above command?

Thanks again for your time!

JonasGeiping commented 1 year ago

Hi,

you can find the hyperparameters with which the model was trained under https://huggingface.co/JonasGeiping/crammed-bert/tree/main. There are three json files for architecture, data and training. These hyperparameters match the train line you tried.

Which head-only results are you refering to, though? I haven't investigated these models in head-only finetuning so far. It is interesting that there is a discrepancy. Over how many finetuning runs are these values measured?

sleeepeer commented 1 year ago

Hi, thank you for replying!

I've tried several times, including expand the time budget to 48 hours and adjust the warmup/cooldown steps. But the results are quite similar (SST-2 head-only 0.76-0.78). Your checkpoint and the bert-base-uncased from HuggingFace are much better (SST-2 head-only 0.92-0.93) than mine in the head-only finetuning.

I checked my training log and found that it can't reach the 600000 steps setting under the 24 hours budget, maybe I should train longer? And I found the loss stayed around 2.+ from step 50000 till the end, there is no significant descent.

[2023-04-09 19:14:58,991] Train loss 2.1326 at step 192000 with lr 0.00080. [Avg: 2.2698]  Perf: 0.2418s per step (67761t/s). Estimated Total Train: 1 day, 16:17:54.853706.
[2023-04-09 19:19:00,805] Train loss 2.1038 at step 193000 with lr 0.00080. [Avg: 2.2686]  Perf: 0.2418s per step (67755t/s). Estimated Total Train: 1 day, 16:18:08.102245.
[2023-04-09 19:23:02,660] Train loss 2.1588 at step 194000 with lr 0.00080. [Avg: 2.2645]  Perf: 0.2419s per step (67743t/s). Estimated Total Train: 1 day, 16:18:32.933493.

•••••••••
[2023-04-10 05:59:22,614] Train loss 2.1526 at step 352000 with lr 0.00003. [Avg: 2.0161]  Perf: 0.2416s per step (67818t/s). Estimated Total Train: 1 day, 16:15:53.389692.
[2023-04-10 06:03:24,192] Train loss 1.9772 at step 353000 with lr 0.00002. [Avg: 2.0155]  Perf: 0.2416s per step (67821t/s). Estimated Total Train: 1 day, 16:15:46.855116.
[2023-04-10 06:07:25,780] Train loss 2.0221 at step 354000 with lr 0.00001. [Avg: 2.0169]  Perf: 0.2416s per step (67818t/s). Estimated Total Train: 1 day, 16:15:52.390909.
[2023-04-10 06:10:02,345] Reached deadline. Stopping training ...

Maybe there is something I can do with the lr schedule? I tried to adjust the warmup/cooldown steps but nothing seems to happen.

Your checkpoint at HuggingFace was trained by this line right? or I missed something important?

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

Thanks for your time!

JonasGeiping commented 1 year ago
sleeepeer commented 1 year ago

Thank you for the advice.

Appreciate it if you have any suggestions!

JonasGeiping commented 1 year ago

Thanks, this is all makes sense to me, this is the correct evalulation, and your modification is also how I would do head-only finetuning. Mysterious that it doesn't always work.

If you want to make a change to the architecture, I could imagine that head-only finetuning would improve when finetuning with a deeper representation. This could be achieved by setting arch.skip_head_transform=False, to re-enable the pretraining head.

sleeepeer commented 1 year ago

Hi, thanks for your help (sorry to take up so much of your time🥲).

What is strange is that the end-to-end and the head-only eval on .bin model both cost about 17mins:

[2023-04-14 07:09:11,643] trainable parameters:789506
[2023-04-14 07:12:41,019] Train loss 0.5227 at step 4208 with lr 0.00004. [Avg: 0.3266] after epoch 0. Perf: 3.4896min per epoch (41170t/s). Estimated Total Train: 0:17:26.872690.
[2023-04-14 07:12:41,020] Validation metric is accuracy: 0.8865 after epoch 0.
[2023-04-14 07:15:46,418] Train loss 0.0180 at step 4208 with lr 0.00003. [Avg: 0.1616] after epoch 1. Perf: 3.0899min per epoch (46495t/s). Estimated Total Train: 0:15:26.984502.
[2023-04-14 07:15:46,419] Validation metric is accuracy: 0.9186 after epoch 1.
[2023-04-14 07:19:12,715] Train loss 0.0435 at step 4208 with lr 0.00002. [Avg: 0.1123] after epoch 2. Perf: 3.4383min per epoch (41785t/s). Estimated Total Train: 0:17:11.478297.
[2023-04-14 07:19:12,717] Validation metric is accuracy: 0.9197 after epoch 2.

The numbers of trainable parameters are 789506 (head-only) and 120379395 (end-to-end, same with the total model parameters).

[2023-04-14 07:32:00,502] Model with architecture ScriptableMaskedLM loaded with 120,379,395 parameters

But when it comes to .pth model, the head-only eval only costs about 5mins (same numbers of trainable parameters), the end-to-end eval still costs 17mins.

[2023-04-14 07:40:40,392] trainable parameters:789506
[2023-04-14 07:41:39,921] Train loss 0.3934 at step 4208 with lr 0.00004. [Avg: 0.5853] after epoch 0. Perf: 0.9921min per epoch (144807t/s). Estimated Total Train: 0:04:57.639380.
[2023-04-14 07:41:39,923] Validation metric is accuracy: 0.7420 after epoch 0.

And you can see that the accuracy is significantly lower than the .bin model⬆️.

So maybe the parameters of .bin model were not really frozen (because of the 17mins cost)? It seems that the problem comes from the HF library. Moreover, there were only 1538 trainable parameters when I did head-only eval on the bert-base-uncased from HF, but it still performed well (SST-2 0.93).

Looking forward to your reply.

JonasGeiping commented 1 year ago

Ok, so part of the story seems to be that freezing works differently for both variants of loading the model - no idea why that would be the case though, maybe you want to delve in a bit deeper there and print out the trainable parameters. For example, you could print out the names of the parameters that change after a model_engine.step() for each of these models.

After thinking about this whole thing for a bit more though, here's something else that is going on: There is a pooler in BERT, which includes a linear layer, nonlinearity, dropout and token selection. This pooler is actually placed in different components for the hf-bert-base model and my implementation. On HF, the pooler is part of the encoder, whereas for me, it is a separate component. So, freezing the encoder will not actually freeze the pooler (this might be why you are seeing 789506 parameters).

Why is it this way?

sleeepeer commented 1 year ago

Thanks for this wonderful reply. I'm so glad to learn all of these details!

For the .pth format model (saved by pretrain.py), the "model_engine.model" and the "model" are the same thing, which means that I can freeze both the model_engine.model.encoder.parameters() and model.encoder.parameters() by freezing one of them.

This was why the following code⬇️ took effect and resulted in the decrease of accuracy.

for param in model.encoder.parameters(): # pooler in the same way
    param.requires_grad = False

Here are the debug outputs of .pth model:

list(model_engine.model.encoder.parameters())[0].requires_grad
False
list(model_engine.model.pooler.parameters())[0].requires_grad
False
list(model_engine.model.head.parameters())[0].requires_grad
True

list(model.encoder.parameters())[0].requires_grad
False
list(model.pooler.parameters())[0].requires_grad
False
list(model.head.parameters())[0].requires_grad
True

But when it comes to the .bin format model (your checkpoint, saved and pushed to HF by load_local_model.py), the "model_engine.model" and the "model" are no more a single object. (This may caused by the HF library.) It means that for the model_engine.model which is used for the training, its parameters can not be frozen by freezing model.encoder.parameters(), it should be done in this way:

for param in model_engine.model.encoder.parameters(): # pooler in the same way
    param.requires_grad = False

Here are the debug outputs of .bin model (I set "model" parameters frozen).

list(model_engine.model.encoder.parameters())[0].requires_grad
True
list(model_engine.model.pooler.parameters())[0].requires_grad
True
list(model_engine.model.head.parameters())[0].requires_grad
True

list(model.encoder.parameters())[0].requires_grad
False
list(model.pooler.parameters())[0].requires_grad
False
list(model.head.parameters())[0].requires_grad
True

The model_engine.model was not frozen⬆️. I also checked the data of parameters, there was no problem with it (the parameter with requires_grad=True will change after model_engine.step()).

Then I ran a eval on both your checkpoint (.bin) and my crammed model (.pth) with model_engine.model frozen. They finally got the same result🤣(accuracy: 0.78 and train time: 5mins).

Thank you very much for answering the questions these days!