Why we need to add size_mul here?

VisualJoyce commented 3 years ago

Hi, I don't quite understand the code here.

https://github.com/ChenRocks/UNITER/blob/80d3602d71d65700eab373acb0507e31e251b7e7/data/sampler.py#L41-L42

self._size_mul is used for partitioning, then why we need to add it when checking if the full token length is exceeded?

badeaadi commented 3 years ago

Hi @VisualJoyce ,

The sampler adds self._size_mul new items to the current batch it is forming. That batch should not exceed the batch_size from your config file, which here is self._max_tok. It must not exceed maximum number of tokens

VisualJoyce commented 3 years ago

Thank you for the answer!

In my case, I am trying to select a best BUCKET_SIZE and self._max_tok. I guess the value is empirically selected, I might need to change this on a different dataset, right?

badeaadi commented 3 years ago

Indeed, they are empirically selected, but I can provide you with my example based on VQA task. I am currently training the uniter large pretrained on 1080 12GB VRAM

For batch_size 1024( in config), the sampler provides batches of 8 examples (self._size_mul)
For batch_size 3072( in config), the sampler provides batches of 24/32 examples (self._size_mul)
For batch_size 5120( in config), the sampler provides batches of 40/48/54 examples (self._size_mul), but sometimes crashes with unable to allocate extra memory on gpu (as we are training single-gpu).

So for me, 3072 is the best, and I imagine you can find yours similarly

ChenRocks / UNITER

Why we need to add size_mul here? #53