arvindrajan92 / DTrOCR

A PyTorch implementation of DTrOCR: Decoder-only Transformer for Optical Character Recognition
MIT License
42 stars 6 forks source link

Results of paper reproduction #9

Open Mistsink opened 1 month ago

Mistsink commented 1 month ago

Thank you very much for your work; I have been following it since the beginning. I would really like to know if you have reproduced the results from the original paper. I also created a multilingual OCR version, using LoRA to fine-tune all layers of the model, but the performance was quite poor. I'm looking forward to your response, thank you very much.

arvindrajan92 commented 1 month ago

hello @Mistsink, thank you for reaching out and closely following my work.

to reproduce the results from the original paper, the model needs to be pre-trained with 10b examples (according to the paper). without this step, no amount of fine-tuning is going to give us the results we desire. for example, i was able to achieve 99% accuracy on IAM train set, but only 75% on the test set. in the paper however, using the pre-trained model, the accuracy on IAM is in the high 90s, which shows the importance of pre-training the model.

this is a personal project i am doing at home in my spare time with an A4000 (16GB) card. while i will (hopefully soon) manage to do some amount of pre-training, it will not be at the scale described in the paper; for that, i will need to hire a cluster of A100s or H100s, and i don't have the budget for that.

out of curiosity, can i ask if you are working on scene text or printed text? or is it handwritten text?

Mistsink commented 1 month ago

Thank you very much for your prompt reply; I thought you would be resting at this time. I understand that this paper primarily aims to address the OCR tasks for multilingual and multi-line texts, involving scene text, printed text, and handwriting, with printed text being the most prominent. If you don't mind, we could explore some collaboration. I currently have four 3090 GPUs available for use. I hope this won't cause you any inconvenience.

arvindrajan92 commented 1 month ago

this is my spare time to work on this, so i am wide awake :smile:

thank you for the offer, @Mistsink. yes, i would definitely be open to collaborate. are you thinking of reproducing the model as described in the paper with the exact same spec?

Mistsink commented 1 month ago

hhhh I am more interested in whether the architecture proposed in this paper is effective; I want to try out multilingual models, rather than just replicate the paper. If you have any ideas, can we connect on a software where it's convenient for us to communicate later?

arvindrajan92 commented 1 month ago

interesting idea. sure, perhaps with some refactoring, i can make the model configurable for multilingual and point you in the right direction.

happy to connect. may i know your whereabouts so that i get an indication on your timezone?

Mistsink commented 1 month ago

I am in China, which is in the UTC+8 time zone. I can use Telegram or Discord; which one works best for you?

arvindrajan92 commented 1 month ago

I am 3 hours ahead of you. What about connection via Discord?

Can we do it one of the weekdays next week if you are good with that?

Mistsink commented 1 month ago

I'm happy to do so! How can I add you on Discord? We can discuss the details there.

Past-Tang commented 1 month ago

hello @Mistsink, thank you for reaching out and closely following my work.你好,感谢您伸出援手并密切关注我的工作。

to reproduce the results from the original paper, the model needs to be pre-trained with 10b examples (according to the paper). without this step, no amount of fine-tuning is going to give us the results we desire.为了重现原始论文的结果,需要用 10B 个样本对模型进行预训练(根据论文)。如果没有这一步,再多的微调也无法给我们带来我们想要的结果。 for example, i was able to achieve 99% accuracy on IAM train set, but only 75% on the test set. in the paper however, using the pre-trained model, the accuracy on IAM is in the high 90s, which shows the importance of pre-training the model.例如,我在 IAM 训练集上能够达到 99% 的准确率,但在测试集上只有 75% 的准确率。然而,在论文中使用预训练模型时,在IAM上的准确率在90s以上,这表明了预训练模型的重要性。

this is a personal project i am doing at home in my spare time with an A4000 (16GB) card.这是我在空闲时间在家里使用 A4000 (16GB) 卡进行的个人项目。 while i will (hopefully soon) manage to do some amount of pre-training, it will not be at the scale described in the paper; for that, i will need to hire a cluster of A100s or H100s, and i don't have the budget for that.虽然我会(希望很快)设法做一些预训练,但不会达到论文中描述的规模;为此,我需要租用一组 A100 或 H100,但我没有这方面的预算。

out of curiosity, can i ask if you are working on scene text or printed text? or is it handwritten text?出于好奇,我能问一下你是在处理场景文本还是印刷文本吗?还是手写文本?

Thank you for your efforts. I would like to inquire about the current status of reproducing the results from certain papers. I have GPU computing resources available on my side which could provide some assistance. Can we discuss this further?

arvindrajan92 commented 1 month ago

@Mistsink are you able to find me using the same username as my github account?

arvindrajan92 commented 1 month ago

Thank you for your efforts. I would like to inquire about the current status of reproducing the results from certain papers. I have GPU computing resources available on my side which could provide some assistance. Can we discuss this further?

@Past-Tang are you also keen on reproducing the DTrOCR model performance reported in the paper?

Past-Tang commented 1 month ago

Yes!

Past-Tang commented 1 month ago

@arvindrajan92 Where is the project currently at? Can we work together to complete the pre-training tasks and attempt to reproduce the performance reported in the papers?

arvindrajan92 commented 1 month ago

@arvindrajan92 Where is the project currently at? Can we work together to complete the pre-training tasks and attempt to reproduce the performance reported in the papers?

I am currently trying to implement key-value caching by referring to HuggingFace's GPT2 implementation. This would improve inference speed.

I am thinking of refactoring the model to be more flexible to accept multilingual. Currently it's designed only for English text using GPT2 vocabulary. If I could successfully refactor the model and its processor, do you have the resources to replicate the training done in the paper?

Past-Tang commented 1 month ago

@arvindrajan92
On my end, I have four NVIDIA A6000 48GB GPUs. If we can expect good results, there's also some additional computing power available. If you think your task can be completed with these resources, I believe we could proceed with the training.

Past-Tang commented 1 month ago

@arvindrajan92 Certainly, I'm cautiously optimistic about the reproducibility of the performance claimed in the paper, but we can give it a try.

arvindrajan92 commented 1 month ago

@Past-Tang key-value caching is still a work in progress and i anticipate finishing up that bit this week.

once that is done, i will look into making the model more flexible for multilingual OCR. currently, i am using GPT2's tokeniser, which is English. assuming you can 1) get the dataset ready following the procedures outlined in the paper; and 2) have a tokeniser module for encoding and decoding, you should be good to pre-train the model for languages of your choice.

Past-Tang commented 1 month ago

@arvindrajan92 Okay

Past-Tang commented 1 month ago

@arvindrajan92 Hey, dear brother, how's the progress going? I've been following your repository.

arvindrajan92 commented 1 month ago

hi @Past-Tang, it's been a little slow but i am making progress. i have added some cards for training and benchmarking scripts. currently i am working on making some notes for preparing the dataset here.

once i can get the model pre-training in a short trial run, i would be more confident in making changes to support multilingual ocr.