clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.53k stars 444 forks source link

validation loss does not decrease #241

Open Mann1904 opened 10 months ago

Mann1904 commented 10 months ago

Hello,

I have been trying to finetune the donut model on my custom dataset. However, I have encountered an issue where the validation loss does not decrease after a few training epochs.

Here are the details of my dataset:

Total number of images in the training set: 12032 Total number of images in the validation set: 1290

Here are the config details that I have used for training;

config = { "max_epochs":30, "val_check_interval":1.0, "check_val_every_n_epoch":1, "gradient_clip_val":1.0, "num_training_samples_per_epoch": 12032, "lr":3e-5, "train_batch_sizes": [1], "val_batch_sizes": [1],

"seed":2022,

      "num_nodes": 1,
      "warmup_steps": 36096,
      "result_path": "./result",
      "verbose": False,
      }

Here is the training log :

Epoch 21: 99% 13160/13320 [51:42<00:37, 4.24it/s, loss=0.0146, v_num=0]

Epoch : 0 | Train loss : 0.13534872224594618 | Validation loss : 0.06959894845040267 Epoch : 1 | Train loss : 0.06630147620920149 | Validation loss : 0.06210419170951011 Epoch : 2 | Train loss : 0.05352105059947349 | Validation loss : 0.07186826165058287 Epoch : 3 | Train loss : 0.04720975606560736 | Validation loss : 0.06583545940979477 Epoch : 4 | Train loss : 0.04027246460695355 | Validation loss : 0.07237467494971456 Epoch : 5 | Train loss : 0.03656758802423008 | Validation loss : 0.06615438500516262 Epoch : 6 | Train loss : 0.03334385565814249 | Validation loss : 0.0690448615986076 Epoch : 7 | Train loss : 0.030216083118764458 | Validation loss : 0.06872327175676446 Epoch : 8 | Train loss : 0.028938407997482745 | Validation loss : 0.06971958731054592 Epoch : 9 | Train loss : 0.02591740866504401 | Validation loss : 0.07369288451116424 Epoch : 10 | Train loss : 0.023537077281242467 | Validation loss : 0.09032832324105358 Epoch : 11 | Train loss : 0.023199086009602708 | Validation loss : 0.08460190268222034 Epoch : 12 | Train loss : 0.02142925070562108 | Validation loss : 0.08330771044260839 Epoch : 13 | Train loss : 0.023064635992034854 | Validation loss : 0.08292237208095442 Epoch : 14 | Train loss : 0.019547534460417258 | Validation loss : 0.0834848547896493 Epoch : 15 | Train loss : 0.018710007107520535 | Validation loss : 0.08551564997306298 Epoch : 16 | Train loss : 0.01841766658555733 | Validation loss : 0.08025501600490885 Epoch : 17 | Train loss : 0.017241064160256097 | Validation loss : 0.10344411130643169 Epoch : 18 | Train loss : 0.015813576313222295 | Validation loss : 0.10317703346507855 Epoch : 19 | Train loss : 0.015648367624887447 | Validation loss : 0.09659983590732446 Epoch : 20 | Train loss : 0.01492729377679406 | Validation loss : 0.09451819387128098

The validation loss appears to fluctuate without showing a consistent decreasing trend. I would appreciate any insights or suggestions on how to address this issue and potentially improve the validation loss convergence.

Thank you for your assistance.

FreestyleMove commented 6 months ago

The train_batch_sizes maybe too small to get global optima. Try more warmup_steps.

DriraYosr commented 1 month ago

Hello @Mann1904! could you please tell me how did you access the information about the loss at each epoch ? I'm not getting any of that