Open jinmel opened 5 years ago
it will releted to the size of your albert model. if your model is like tiny it consume memory very few. but if you use xlarge, the memory consume it much bigger than bert large.
the model architecture of BERT Large and ALBERT large only differs by its weight sharing. But in fact I was able to only increase batch size from 40 to 112.
According to the paper, BERT large has 334M params and ALBERT LARGE has 18M params. Shouldn't I be able to utilize much much more batch size?
I have also seen that TPU making all the operation like:
layer_shared_0/attention/dense:0 layer_shared_1/attention/dense:0 ....
although the layers where shared. it seems like TPU compiles tensorflow graph nodes regardless of weight sharing. Have you seen this on your side?
I am trying to run your code on TPU estimator.
The paper says that it reduced TPU/GPU memory consumption by layer weight sharing, but for some reason ALBERT consumes same amount of memory as BERT w/o weight sharing.
Is this expected?