Fine-tuning BERT-keras - Githubissues

astariul commented 5 years ago

I'm trying to fine-tune BERT-keras on the STS-B dataset.

Did someone already use this repo to fine-tune BERT on an end-to-end task ? Is there code example for this ?

I have difficulties to make it work. My runtime die before even training on 1 single batch...

You can take a look at my notebook here : Colab

@HighCWu

HighCWu commented 5 years ago

It should be the GPU memory problem. 32 batch size or 8 batch size is too large for GPU. Colab GPU has only 12GB memory. Maybe it is just a fine size for the origin model train on 8 batch size, but not for your new model, which has more training parameters. And the TPU colab also just has 12GB memory available. And due to official keras tpu implement has some problems on cloning models to TPU, I tried it on TPU failed. But on GPU, 2 batch size can train, maybe 4 is OK. I don't know why your runtime died on 1 single batch, while I succeeded on batch size 2. However, t's too slow due to the Goole free cheap k80 GPU. Training based on BERT pretrained model is not cheap. @Colanim

astariul commented 5 years ago

Thank you very much for your enlightening insights @HighCWu !

Indeed I didn't reduce the batch size enough. It's working with batch_size = 4, but not working for batch_size = 8.

As you said, training is way too slow. I think it's because the batch size is too little. I cannot apply this architecture to fine-tuning this task on STS-B for now...

TPU is not working and I have no clue why. The error I receive is :

ValueError : Unknown layer : LayerNorm

However my model does not use this layer, only the Transformer use it. And if I try to compile only the Transformer model on TPU, it's working fine...

astariul commented 5 years ago

Changing the maximum sequence length greatly helped !

As described in the official BERT README :

System	Seq Length	Max Batch Size
BERT-Base	64	64
...	128	32
...	256	16
...	320	14
...	384	12
...	512	6

Using lower sequence input size allow to use bigger batch size.

With a sequence input size of 512, the maximum batch size I could use was 4. Training time = 50 hours per epoch

With a sequence input size of 64, the maximum batch size I can use is 64. Training time = 3 hours per epoch

However training time is still long..

astariul commented 5 years ago

It's possible to further reduce the training time by freezing some layers of the Transformer.

I kept only the last 3 Encoders Trainable, and now my training time is 1h30 per epoch !

HighCWu commented 5 years ago

This kind of frozen layer migration learning method is different from fine tuning, which will increase the speed, but at the same time, if the information learned in the previous layers and the newly trained data are not the same distribution, the learning may have more errors than the fine-tuning method.

astariul commented 5 years ago

@HighCWu Thank you for your opinion.

My goal when freezing layer is to have a model that can still be fine tuned (the last 3 encoders can be fine tuned). As you said, only the last 3 encoders will be fine tuned, not the all BERT architecture.

Since I'm quite new I thought this is a good idea as we still somehow finetune the last layers of BERT, without having to train the whole (too big) architecture.

Anyway, even if the whole architecture was trainable, only the last layers will really change at fine tuning time, right ?

I thought that frozing the first layers will not affect performance that much. Was I wrong ?

HighCWu commented 5 years ago

I don't think you were wrong. After all, in transfer learning, even if the entire model can be trained, the original weight will not change too much. Google researchers suggested fine-tuning rather than freezing completely. But that is really too expensive.

astariul commented 5 years ago

Thanks for your input. I will try both approach and see which one is the best for me :)

MHDBST commented 5 years ago

@Colanim How do you fine tune just specific layers? I know how to fine tune whole model, and how to not fine tune at all. But I don't see any specific parameters for fine tuning specific layers, could you help me find it out?

Separius / BERT-keras

Fine-tuning BERT-keras #14