training compatibility with RTL languages

NormXU / nougat-latex-ocr

Codebase for fine-tuning / evaluating nougat-based image2latex generation models

https://arxiv.org/abs/2308.13418

Apache License 2.0

115 stars 12 forks source link

training compatibility with RTL languages #4

Closed AhmadHakami closed 6 months ago

AhmadHakami commented 6 months ago

thanks for your efforts and i would d like to ask if it is possible to train nougat on different right to left languages like hebrew, arabic, and persian

additionally i believe we would benefit if you can add more information on data preparation and particularly regarding the ideal image sizes and labeling formats :)

NormXU commented 6 months ago

@AhmadHakami I am afraid that training Nougat on right-to-left languages such as Hebrew, Arabic, and Persian will have limited performance compared to fine-tuning the latest language model. Although fine-tuning Nougat-LaTex can bring satisfying results on img2latex task, it is because of the strong base model, nougat-base, which has been pre-trained on millions of tokens in science paper reports and has already seen many latex formulas at that stage. What I am doing here with the LaTeX dataset is only to "activate" its corresponding ability.

As for data preparation, I think the most effective method from my practice is to adjust the input resolution and use an adaptive padding approach to ensure that equation image segments in the wild are resized to closely match the resolution of the training data. Please check the processor here.

What is your task BTW?

AhmadHakami commented 6 months ago

i'm trying to recognize text in both arabic and english languages within a single image or document similar to the surya model. however there are issues with surya: the training source code is not available and it does not recognize numbers, regulars, and dates in both languages (arabic and english) and its not recognize some other fonts

this example contain arabic content with some english & arabic numbers, dates and regulars: تمديد المدة الخاصة بتطوير وبناء المنصة لمدة (١٢٠) يوماً إضافياً ابتداء من ١٧/٢/١٤٤٥هـ. وبعد الاطلاع على ضوابط قرار رقم (٥٦٣) وتاريخ 15/8/1444هـ. وبعد الاطلاع على المذكرتين رقم (٨٤٩) وتاريخ ١١/٣/١٤٤٥هـ ، ورقم (١٠٥٣) وتاريخ ٢٧/٣/١٤٤٥هـ ، المعدتين في الهيئة، وبعد الاطلاع على المحضرين المعدين برقم (٥٤٢/٤٥/م) وتاريخ [ ١٥/٣/١٤٤٥هـ ]، ورقم (٦٥٣/٤٥/م)

NormXU commented 6 months ago

@AhmadHakami Looks like an OCR task. It is hard to find a good open-source text recognition model for Arabic. My suggestion is to train one with Arabic data.

A simple roadmap.

Text Detection:

check out DBNet. It is a good and easy to train according to my experience.
Data Preparation: You can start with open checkpoints of DBNet and fine-tune with some Arabic documents. Fine-tuning doesn't need too much data, ~1k is enough I guess.

Text Recognition:

You may want to start with CRNN, a classical text recognition model.
Data Preparation: You'll need about 50k image segments cropped from document text lines, each paired up with the actual text for training your recognition model. There are some open-source datasets, please check this repo; However, finding Arabic datasets / open-source CRNN checkpoints might be a bit hard. I suggest making sure more than 50% of your training dataset is in Arabic.

Hope this can help. Feel free to ask me if you have any questions.

AhmadHakami commented 6 months ago

after finishing the finetuning i found all the weights in this directory nougat-latex-ocr/workspace/nougat_latex and all of them in zip format and when i extracted the files from one of the zip files i found a long-name folder in it containing: 1. three files: byteorder, data.pkl, version 2. data folder: containing many files named with numbers: 1, 2, 3, 4, ...

how can i predict using the run_latex_ocr python file using the finetuned weights?

NormXU commented 6 months ago

@AhmadHakami there should be a *.pth file under the workspace folder

AhmadHakami commented 6 months ago

the same problem: OSError: Incorrect path_or_model_id: 'workspace/nougat-base_epoch3_step260000_lr1.870306e-05_avg_loss0.06036_token_acc0.51288_edit_dis0.10794.pth'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

this is the .pth file if you could try it on your machine

NormXU commented 6 months ago

@AhmadHakami Please try the following script

model_name = "Norm/nougat-latex-base"
checkpoint_path = "./workspace/nougat-base_epoch3_step260000_lr1.870306e-05_avg_loss0.06036_token_acc0.51288_edit_dis0.10794.pth"  # this is your fine-tuned checkpoint
device = "cuda" if torch.cuda.is_available() else "CPU"

# Initialize the model
model = VisionEncoderDecoderModel.from_pretrained(model_name).to(device)

# load the model weight
state_dict = torch.load(checkpoint_path, map_location=torch.device("cpu"))
model.load_state_dict(model_state_dict, strict=strict)

AhmadHakami commented 6 months ago

it works with model.load_state_dict(state_dict, strict=True) thank you life saver :)