clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.73k stars 465 forks source link

Train doesn't start #186

Open Vadkoz opened 1 year ago

Vadkoz commented 1 year ago

Trying to train on 2 A5000, but training doesnt start, just stuck here:

Resolving data files: 100%|| 414/414 [00:00<00:00, 100071.57it/s]
Resolving data files: 100%| 20/20 [00:00<00:00, 19733.26it/s]
Resolving data files: 100%| 59/59 [00:00<00:00, 28519.53it/s]
Using custom data configuration synthdog_test-c3da780a4a954bd2
Found cached dataset imagefolder (/root/.cache/huggingface/datasets/imagefolder/synthdog_test-c3da780a4a954bd2/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)
Resolving data files: 100%| 414/414 [00:00<00:00, 36854.61it/s]
Resolving data files: 100%| 20/20 [00:00<00:00, 12244.36it/s]
Resolving data files: 100%| 59/59 [00:00<00:00, 17413.55it/s]
Using custom data configuration synthdog_test-c3da780a4a954bd2
Found cached dataset imagefolder (/root/.cache/huggingface/datasets/imagefolder/synthdog_test-c3da780a4a954bd2/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)
[rank: 1] Global seed set to 2022
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

And memory consumption stucked too image

One GPU works well. I using unchanged repo and ~500 synthdog images. 2х А5000, CUDA 11.7

PirateX0 commented 1 year ago

I have the same error info. The difference is that: 1) One GPU does not work. 2) In my case, the script directly finishes without training, instead of being "stuck".

solution: "pip install ." will install 1.13. reinstall pytorch as follows. torch == 1.11.0+cu113 torchvision == 0.12.0+cu113