brightmart / albert_zh

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型
https://arxiv.org/pdf/1909.11942.pdf
3.93k stars 754 forks source link

TPU memory consumption #79

Open jinmel opened 4 years ago

jinmel commented 4 years ago

I am trying to run your code on TPU estimator.

The paper says that it reduced TPU/GPU memory consumption by layer weight sharing, but for some reason ALBERT consumes same amount of memory as BERT w/o weight sharing.

Is this expected?

brightmart commented 4 years ago

it will releted to the size of your albert model. if your model is like tiny it consume memory very few. but if you use xlarge, the memory consume it much bigger than bert large.

jinmel commented 4 years ago

the model architecture of BERT Large and ALBERT large only differs by its weight sharing. But in fact I was able to only increase batch size from 40 to 112.

According to the paper, BERT large has 334M params and ALBERT LARGE has 18M params. Shouldn't I be able to utilize much much more batch size?

I have also seen that TPU making all the operation like:

layer_shared_0/attention/dense:0 layer_shared_1/attention/dense:0 ....

although the layers where shared. it seems like TPU compiles tensorflow graph nodes regardless of weight sharing. Have you seen this on your side?