Reproduce the result when freezing parameters

JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.

MIT License

1.3k stars 100 forks source link

Reproduce the result when freezing parameters #15

Closed sleeepeer closed 1 year ago

sleeepeer commented 1 year ago

Hi, thank you for this wonderful work.

I met with some troubles when reproducing the head only results. I mean, I can reproduce your results on end-to-end tuning, but when I freeze the BERT (encoder) parameters and only tune the classification head, the result can not be as good as your checkpoint.

The SST-2 accuracy of your checkpoint at https://huggingface.co/JonasGeiping/crammed-bert is 0.922 (end-to-end) and 0.918 (head only) in my reproduction. The bert-base-uncased (from HuggingFace) accuracy is 0.931 (end-to-end) and 0.930 (head only).

I downloaded the c4-subset-processed from your dropbox link and I replicated your work by running:

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

The end-to-end accuracy on SST-2 is 0.922 but the head only acuuracy is only 0.784. I'm wondering why I got this problem.

I freeze the encoder parameters by:

for param in model.encoder.parameters():
    param.requires_grad = False

I also want to know how the checkpoint at https://huggingface.co/JonasGeiping/crammed-bert was trained? Was it trained by running the above command?

Thanks again for your time!

JonasGeiping commented 1 year ago

Hi,

you can find the hyperparameters with which the model was trained under https://huggingface.co/JonasGeiping/crammed-bert/tree/main. There are three json files for architecture, data and training. These hyperparameters match the train line you tried.

Which head-only results are you refering to, though? I haven't investigated these models in head-only finetuning so far. It is interesting that there is a discrepancy. Over how many finetuning runs are these values measured?

sleeepeer commented 1 year ago

Hi, thank you for replying!

I've tried several times, including expand the time budget to 48 hours and adjust the warmup/cooldown steps. But the results are quite similar (SST-2 head-only 0.76-0.78). Your checkpoint and the bert-base-uncased from HuggingFace are much better (SST-2 head-only 0.92-0.93) than mine in the head-only finetuning.

I checked my training log and found that it can't reach the 600000 steps setting under the 24 hours budget, maybe I should train longer? And I found the loss stayed around 2.+ from step 50000 till the end, there is no significant descent.

[2023-04-09 19:14:58,991] Train loss 2.1326 at step 192000 with lr 0.00080. [Avg: 2.2698]  Perf: 0.2418s per step (67761t/s). Estimated Total Train: 1 day, 16:17:54.853706.
[2023-04-09 19:19:00,805] Train loss 2.1038 at step 193000 with lr 0.00080. [Avg: 2.2686]  Perf: 0.2418s per step (67755t/s). Estimated Total Train: 1 day, 16:18:08.102245.
[2023-04-09 19:23:02,660] Train loss 2.1588 at step 194000 with lr 0.00080. [Avg: 2.2645]  Perf: 0.2419s per step (67743t/s). Estimated Total Train: 1 day, 16:18:32.933493.

•••••••••
[2023-04-10 05:59:22,614] Train loss 2.1526 at step 352000 with lr 0.00003. [Avg: 2.0161]  Perf: 0.2416s per step (67818t/s). Estimated Total Train: 1 day, 16:15:53.389692.
[2023-04-10 06:03:24,192] Train loss 1.9772 at step 353000 with lr 0.00002. [Avg: 2.0155]  Perf: 0.2416s per step (67821t/s). Estimated Total Train: 1 day, 16:15:46.855116.
[2023-04-10 06:07:25,780] Train loss 2.0221 at step 354000 with lr 0.00001. [Avg: 2.0169]  Perf: 0.2416s per step (67818t/s). Estimated Total Train: 1 day, 16:15:52.390909.
[2023-04-10 06:10:02,345] Reached deadline. Stopping training ...

Maybe there is something I can do with the lr schedule? I tried to adjust the warmup/cooldown steps but nothing seems to happen.

Your checkpoint at HuggingFace was trained by this line right? or I missed something important?

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

Thanks for your time!

JonasGeiping commented 1 year ago

The 600k step setting is just a fallback - within 24hrs my setup on an A6000 makes about 440k steps and finishes at a loss of around 2.0. Loss should descrease logarithmically within the final stages of training (so shouldn't be quite flat, even if it looks like that on a linear chart).
If you are extending the budget, there should be no need to change the lr schedule, if the scheduler is a budget-based scheduler, like budget-triangle-2.
My checkpoint is indeed trained with that line exactly.
Just as a sanity check, can you confirm that your downstream eval uses these finetuning settings: https://huggingface.co/JonasGeiping/crammed-bert/blob/main/arch_budget_hours_24.json#L55?
What is your setup? Especially, what is your GPU? How many tokens/second does the setup achieve in later stages of training?

sleeepeer commented 1 year ago

Thank you for the advice.

I used this line for downstream eval (as a sanity check I only tested the SST-2 accuracy):

python eval.py eval=GLUE_sane name=amp_b4096_c5_o3_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True

My only modification is to add this in theeval.py after loading the model_engine:
```
for param in model.encoder.parameters():
param.requires_grad = False
```
My pretraining setup is just like yours except for the GPU. My GPU is RTX-6000 24GB, the training speed is about 67800t/s (showed in the previous comment). After 24 hours it reached 354k steps (which means about 30 hours for 440k steps ).
I'm trying 30 hours budget without any other modification.

Appreciate it if you have any suggestions!

JonasGeiping commented 1 year ago

Thanks, this is all makes sense to me, this is the correct evalulation, and your modification is also how I would do head-only finetuning. Mysterious that it doesn't always work.

If you want to make a change to the architecture, I could imagine that head-only finetuning would improve when finetuning with a deeper representation. This could be achieved by setting arch.skip_head_transform=False, to re-enable the pretraining head.

sleeepeer commented 1 year ago

Hi, thanks for your help (sorry to take up so much of your time🥲).

I tried to set arch.skip_head_transform=False and there is indeed a slight improvement (e.g. from 0.75 to 0.78).
Besides that I found something really strange but interesting🤣. I used your load_local_model.py to push my crammed model to HF (means that convert .pth model to .bin), then I downloaded it back using your hf:// setting. Then I froze the parameters and launched the head-only eval, the accuracy of SST-2 is about 0.92 (just as your checkpoint).

What is strange is that the end-to-end and the head-only eval on .bin model both cost about 17mins:

[2023-04-14 07:09:11,643] trainable parameters:789506
[2023-04-14 07:12:41,019] Train loss 0.5227 at step 4208 with lr 0.00004. [Avg: 0.3266] after epoch 0. Perf: 3.4896min per epoch (41170t/s). Estimated Total Train: 0:17:26.872690.
[2023-04-14 07:12:41,020] Validation metric is accuracy: 0.8865 after epoch 0.
[2023-04-14 07:15:46,418] Train loss 0.0180 at step 4208 with lr 0.00003. [Avg: 0.1616] after epoch 1. Perf: 3.0899min per epoch (46495t/s). Estimated Total Train: 0:15:26.984502.
[2023-04-14 07:15:46,419] Validation metric is accuracy: 0.9186 after epoch 1.
[2023-04-14 07:19:12,715] Train loss 0.0435 at step 4208 with lr 0.00002. [Avg: 0.1123] after epoch 2. Perf: 3.4383min per epoch (41785t/s). Estimated Total Train: 0:17:11.478297.
[2023-04-14 07:19:12,717] Validation metric is accuracy: 0.9197 after epoch 2.

The numbers of trainable parameters are 789506 (head-only) and 120379395 (end-to-end, same with the total model parameters).

[2023-04-14 07:32:00,502] Model with architecture ScriptableMaskedLM loaded with 120,379,395 parameters

But when it comes to .pth model, the head-only eval only costs about 5mins (same numbers of trainable parameters), the end-to-end eval still costs 17mins.

[2023-04-14 07:40:40,392] trainable parameters:789506
[2023-04-14 07:41:39,921] Train loss 0.3934 at step 4208 with lr 0.00004. [Avg: 0.5853] after epoch 0. Perf: 0.9921min per epoch (144807t/s). Estimated Total Train: 0:04:57.639380.
[2023-04-14 07:41:39,923] Validation metric is accuracy: 0.7420 after epoch 0.

And you can see that the accuracy is significantly lower than the .bin model⬆️.

So maybe the parameters of .bin model were not really frozen (because of the 17mins cost)? It seems that the problem comes from the HF library. Moreover, there were only 1538 trainable parameters when I did head-only eval on the bert-base-uncased from HF, but it still performed well (SST-2 0.93).

Looking forward to your reply.

JonasGeiping commented 1 year ago

Ok, so part of the story seems to be that freezing works differently for both variants of loading the model - no idea why that would be the case though, maybe you want to delve in a bit deeper there and print out the trainable parameters. For example, you could print out the names of the parameters that change after a model_engine.step() for each of these models.

After thinking about this whole thing for a bit more though, here's something else that is going on: There is a pooler in BERT, which includes a linear layer, nonlinearity, dropout and token selection. This pooler is actually placed in different components for the hf-bert-base model and my implementation. On HF, the pooler is part of the encoder, whereas for me, it is a separate component. So, freezing the encoder will not actually freeze the pooler (this might be why you are seeing 789506 parameters).

Why is it this way?

bert-base is pretrained with NSP (next-sentence-prediction). NSP pre-trains the same pooler that is later loaded for SequenceClassification. This means that bert-base can do head-only training of only the 1536-dim last linear layer and it makes sense to load the pooler from the pretrained model.
The crammed BERT model is not trained with NSP. There is hence no pooler during pretraining, and the pooler parameters have to be adapted during finetuning.
bert-base is pretrained with the pooler attached to the [CLS] token. This implies that the [CLS] token embedding is tuned during pretraining. For the crammed BERT model, there is no [CLS] token during pretraining, so this row of the embedding is still random. This is probably another reason why head-only finetuning is worse than end-to-end.

sleeepeer commented 1 year ago

Thanks for this wonderful reply. I'm so glad to learn all of these details!

I tried to freeze both the encoder and the pooler and there are 2050 remaining parameters. However, this has no impact on the accuracy and the eval speed (accuracy 0.92 and 17mins, same with the end-to-end).
Then I checked the parameters in debug mode and found the answer.

For the .pth format model (saved by pretrain.py), the "model_engine.model" and the "model" are the same thing, which means that I can freeze both the model_engine.model.encoder.parameters() and model.encoder.parameters() by freezing one of them.

This was why the following code⬇️ took effect and resulted in the decrease of accuracy.

for param in model.encoder.parameters(): # pooler in the same way
    param.requires_grad = False

Here are the debug outputs of .pth model:

list(model_engine.model.encoder.parameters())[0].requires_grad
False
list(model_engine.model.pooler.parameters())[0].requires_grad
False
list(model_engine.model.head.parameters())[0].requires_grad
True

list(model.encoder.parameters())[0].requires_grad
False
list(model.pooler.parameters())[0].requires_grad
False
list(model.head.parameters())[0].requires_grad
True

But when it comes to the .bin format model (your checkpoint, saved and pushed to HF by load_local_model.py), the "model_engine.model" and the "model" are no more a single object. (This may caused by the HF library.) It means that for the model_engine.model which is used for the training, its parameters can not be frozen by freezing model.encoder.parameters(), it should be done in this way:

for param in model_engine.model.encoder.parameters(): # pooler in the same way
    param.requires_grad = False

Here are the debug outputs of .bin model (I set "model" parameters frozen).

list(model_engine.model.encoder.parameters())[0].requires_grad
True
list(model_engine.model.pooler.parameters())[0].requires_grad
True
list(model_engine.model.head.parameters())[0].requires_grad
True

list(model.encoder.parameters())[0].requires_grad
False
list(model.pooler.parameters())[0].requires_grad
False
list(model.head.parameters())[0].requires_grad
True

The model_engine.model was not frozen⬆️. I also checked the data of parameters, there was no problem with it (the parameter with requires_grad=True will change after model_engine.step()).

Then I ran a eval on both your checkpoint (.bin) and my crammed model (.pth) with model_engine.model frozen. They finally got the same result🤣(accuracy: 0.78 and train time: 5mins).

Thank you very much for answering the questions these days!