Open Lizw14 opened 4 years ago
During iterative message passing, the intermediate predictions will be embedded to improve the final prediction, so the size of vocabulary also affects some intermediate layers. Maybe you can try TransformerPredictor, as far as I know, it's the most memory efficient model.
❓ Questions and Help
Hi Kaihua, I am training on GQA dataset, using a larger object_label(1685)/attribute(619)/relation(312) vocab size. However, the model size grows significantly in this case. The model cannot be put into one GPU memory (11GB) even if batchsize=1 per GPU. Intuitively model size should not be exploding like this because vocab_size only increases the size of the final layer in ROI/attribute/relation heads. Do you have any thought on why this happens?