finetuning for custom dataset of mutiline equations

Shobhit1201 commented 1 week ago

I am new to this therefore i am unable to understand how to train/finetune the original model. i tried to use the original model checkpoints in your version to finetune but it gives some error that weights are missing or unexpected but for inference it works fine. Could you please help how should i move forward.

moreover in the original model it has a very different type of pdf dataset but mine is similar to yours. I am stuck. Also i have a variety of pictures with varied dimensions, only these big equations are not recognized correctly otherwise it recognizes most of the multiline equations. @NormXU

Shobhit1201 commented 1 week ago

@NormXU
I started with the nougat latex based. I grouped images on the basis of aspect ratios as suggested by you and the model seemed to work fine on wide or ultra wide images but it struggles to work on very complex set of large equations even after finetuning for 100 epochs.

Now, if i start with the original model, i am unable to understand how to provide the input as my dataset since it accepts pdf dataset in a different format. Could you please help?

NormXU commented 1 week ago

@Shobhit1201 As for the first question:

Also i have a variety of pictures with varied dimensions, only these big equations are not recognized correctly otherwise it recognizes most of the multiline equations.

This sounds reasonable because large equations are out-of-distribution (OOD) for the model. However, the aspect ratios of most multi-line equations are close to those in my training dataset, which enables the model to recognize them correctly.

To address the OOD problem, I suggest the most effective solution would be to scale up the dataset or fine-tune the model using the original ckpt.

Fine-tuning the nougat-base checkpoint in this codebase is quite simple. You can begin with this configuration file.

facebook/nougat-base is a VisionEncoderDecoder model, so there’s no need to modify the model architecture or distributed training code. You can check this paper for more details about the arch. Your main task is creating a compatible dataset and updating this line with the path to your dataset.

For instance, if your dataset consists of PDF files, preprocess them so your dataset is structured as follows:

Image paths (e.g., ./image/path/equ.txt)
Corresponding LaTeX paths (e.g., ./latex/path/equation.txt)

Refer to the NougatDataset class in this script for creating a DataLoader that provides (image, text) pairs for each iteration. Each batch of (image, text) pairs is then fed into the model, as shown https://github.com/NormXU/nougat-latex-ocr/blob/d735d3a31bfd0cd48a020e01c5233a9154c6d4c2/experiment/donut_experiment.py#L112-L118

input_args_list = ['pixel_values', 'labels', 'decoder_input_ids']
- pixel_values: image tensors
- labels: ground-truth tokens, e.g., [token1, token2, token3, ..., eos]
- decoder_input_ids: right-shifted labels (e.g., [bos, token1, token2, token3]) since the decoder is auto-regressive

Shobhit1201 commented 1 week ago

my Dataset consists of 6000 images of multiline equations int the form: a folder named "Images" and the corresponding latex file. I am getting your point of scaling up the dataset. Width Range: Min = 159, Max = 1562 Height Range: Min = 24, Max = 1168 Mean Width: 781.24, Mean Height: 320.87 Median Width: 758.0, Median Height: 295.0 When i tried to pass the original model checkpoint file "pytorch_model.bin" file in the model path, it shows an error that weights are missing or unexpected. but during inference the same model file works. Can you please help how should i tackle it.

NormXU commented 1 week ago

@Shobhit1201 Model Initialization is here https://github.com/NormXU/nougat-latex-ocr/blob/d735d3a31bfd0cd48a020e01c5233a9154c6d4c2/experiment/donut_experiment.py#L135-L145

You need to pass a model dir rather than a path to pytorch_model.bin file. The dir to the folder that has the pytorch_model.bin file

Shobhit1201 commented 12 hours ago

where is the evaluation script for token_acc?

NormXU / nougat-latex-ocr

finetuning for custom dataset of mutiline equations #10