Open Mistsink opened 1 month ago
hello @Mistsink, thank you for reaching out and closely following my work.
to reproduce the results from the original paper, the model needs to be pre-trained with 10b examples (according to the paper). without this step, no amount of fine-tuning is going to give us the results we desire. for example, i was able to achieve 99% accuracy on IAM train set, but only 75% on the test set. in the paper however, using the pre-trained model, the accuracy on IAM is in the high 90s, which shows the importance of pre-training the model.
this is a personal project i am doing at home in my spare time with an A4000 (16GB) card. while i will (hopefully soon) manage to do some amount of pre-training, it will not be at the scale described in the paper; for that, i will need to hire a cluster of A100s or H100s, and i don't have the budget for that.
out of curiosity, can i ask if you are working on scene text or printed text? or is it handwritten text?
Thank you very much for your prompt reply; I thought you would be resting at this time. I understand that this paper primarily aims to address the OCR tasks for multilingual and multi-line texts, involving scene text, printed text, and handwriting, with printed text being the most prominent. If you don't mind, we could explore some collaboration. I currently have four 3090 GPUs available for use. I hope this won't cause you any inconvenience.
this is my spare time to work on this, so i am wide awake :smile:
thank you for the offer, @Mistsink. yes, i would definitely be open to collaborate. are you thinking of reproducing the model as described in the paper with the exact same spec?
hhhh I am more interested in whether the architecture proposed in this paper is effective; I want to try out multilingual models, rather than just replicate the paper. If you have any ideas, can we connect on a software where it's convenient for us to communicate later?
interesting idea. sure, perhaps with some refactoring, i can make the model configurable for multilingual and point you in the right direction.
happy to connect. may i know your whereabouts so that i get an indication on your timezone?
I am in China, which is in the UTC+8 time zone. I can use Telegram or Discord; which one works best for you?
I am 3 hours ahead of you. What about connection via Discord?
Can we do it one of the weekdays next week if you are good with that?
I'm happy to do so! How can I add you on Discord? We can discuss the details there.
hello @Mistsink, thank you for reaching out and closely following my work.你好,感谢您伸出援手并密切关注我的工作。
to reproduce the results from the original paper, the model needs to be pre-trained with 10b examples (according to the paper). without this step, no amount of fine-tuning is going to give us the results we desire.为了重现原始论文的结果,需要用 10B 个样本对模型进行预训练(根据论文)。如果没有这一步,再多的微调也无法给我们带来我们想要的结果。 for example, i was able to achieve 99% accuracy on IAM train set, but only 75% on the test set. in the paper however, using the pre-trained model, the accuracy on IAM is in the high 90s, which shows the importance of pre-training the model.例如,我在 IAM 训练集上能够达到 99% 的准确率,但在测试集上只有 75% 的准确率。然而,在论文中使用预训练模型时,在IAM上的准确率在90s以上,这表明了预训练模型的重要性。
this is a personal project i am doing at home in my spare time with an A4000 (16GB) card.这是我在空闲时间在家里使用 A4000 (16GB) 卡进行的个人项目。 while i will (hopefully soon) manage to do some amount of pre-training, it will not be at the scale described in the paper; for that, i will need to hire a cluster of A100s or H100s, and i don't have the budget for that.虽然我会(希望很快)设法做一些预训练,但不会达到论文中描述的规模;为此,我需要租用一组 A100 或 H100,但我没有这方面的预算。
out of curiosity, can i ask if you are working on scene text or printed text? or is it handwritten text?出于好奇,我能问一下你是在处理场景文本还是印刷文本吗?还是手写文本?
Thank you for your efforts. I would like to inquire about the current status of reproducing the results from certain papers. I have GPU computing resources available on my side which could provide some assistance. Can we discuss this further?
@Mistsink are you able to find me using the same username as my github account?
Thank you for your efforts. I would like to inquire about the current status of reproducing the results from certain papers. I have GPU computing resources available on my side which could provide some assistance. Can we discuss this further?
@Past-Tang are you also keen on reproducing the DTrOCR model performance reported in the paper?
Yes!
@arvindrajan92 Where is the project currently at? Can we work together to complete the pre-training tasks and attempt to reproduce the performance reported in the papers?
@arvindrajan92 Where is the project currently at? Can we work together to complete the pre-training tasks and attempt to reproduce the performance reported in the papers?
I am currently trying to implement key-value caching by referring to HuggingFace's GPT2 implementation. This would improve inference speed.
I am thinking of refactoring the model to be more flexible to accept multilingual. Currently it's designed only for English text using GPT2 vocabulary. If I could successfully refactor the model and its processor, do you have the resources to replicate the training done in the paper?
@arvindrajan92
On my end, I have four NVIDIA A6000 48GB GPUs. If we can expect good results, there's also some additional computing power available. If you think your task can be completed with these resources, I believe we could proceed with the training.
@arvindrajan92 Certainly, I'm cautiously optimistic about the reproducibility of the performance claimed in the paper, but we can give it a try.
@Past-Tang key-value caching is still a work in progress and i anticipate finishing up that bit this week.
once that is done, i will look into making the model more flexible for multilingual OCR. currently, i am using GPT2's tokeniser, which is English. assuming you can 1) get the dataset ready following the procedures outlined in the paper; and 2) have a tokeniser module for encoding and decoding, you should be good to pre-train the model for languages of your choice.
@arvindrajan92 Okay
@arvindrajan92 Hey, dear brother, how's the progress going? I've been following your repository.
hi @Past-Tang, it's been a little slow but i am making progress. i have added some cards for training and benchmarking scripts. currently i am working on making some notes for preparing the dataset here.
once i can get the model pre-training in a short trial run, i would be more confident in making changes to support multilingual ocr.
Thank you very much for your work; I have been following it since the beginning. I would really like to know if you have reproduced the results from the original paper. I also created a multilingual OCR version, using LoRA to fine-tune all layers of the model, but the performance was quite poor. I'm looking forward to your response, thank you very much.