DONUT training deviates between pytorch lightning and HF Trainer

Dear Team, Dear @NielsRogge,

You are doing excellent work in NLP space. Thank you for contributing DONUT to the HF transformers. In following the notebook-examples and implementing my own DONUT training for information extraction, I noticed a stark difference between the training with pytorch-lightning data modules + trainers (as you did in the NB on CORD with DONUT) and using a NB with a HF dataset + Seq2Seq Trainer.

I noticed an evaluation difference of ~20 percent w.r.t the metrics defined in the CLOVA-AI research repo (F1, TED accuracy).

Do you have an idea in how much the Seq2Seq Trainer (or the HF image datasets) does things differently than you in you NB using the pytorch-lightning trainer?

I'd appreciate any help! Best, Max

NielsRogge / Transformers-Tutorials

DONUT training deviates between pytorch lightning and HF Trainer #322