clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.7k stars 462 forks source link

How did training with a batch size of 8 fit onto a single A100? #181

Open csanadpoda opened 1 year ago

csanadpoda commented 1 year ago

In the "Training" section, you mention you used a single A100 with the attached config yaml. An A100 has either 40 or 80GB of VRAM. The batch size is set to 8 in train_cord.yaml with a resolution of [1280, 960].

On a 24 GB 4090 with torch.set_float32_matmul_precision('high') and a resolution of around [1920, 1600] (if I remember correctly, but definitively a bit above [1280, 960]) a batch size of one already takes up 20+ GBs of VRAM, but according to this GitHub page, you managed to fit a batch size of 8 - 8 times my batch size on 4 times the VRAM.

May I ask how this was done? Did you use lower precision? Or did the resolution make such a huge difference? Thank you!

gwkrsrch commented 1 year ago

Hi @csanadpoda , yes, we used fp16 ( https://github.com/clovaai/donut/blob/master/train.py#L127 ). Hope this helps ;)

csanadpoda commented 1 year ago

Hi @csanadpoda , yes, we used fp16 ( https://github.com/clovaai/donut/blob/master/train.py#L127 ). Hope this helps ;)

Yes I'm using the same, I guess the resolution increase is the issue with me (I'm training [1600, 1920]). That's 1228800 vs. 3072000 pixels to process, less than half, so it makes sense. Reducing the resolution to [1280, 960] allows me to use batch sizes of 2 (maybe even 3).