facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.98k stars 567 forks source link

How to make it work for any image #52

Open XFastDataLab opened 1 year ago

XFastDataLab commented 1 year ago

I write an equation in a white paper, and take a photo by my phone, then use Nougat to extract equation, but I got nothing, what's the problem?

marwinsteiner commented 1 year ago

Nougat is designed for working with PDF files -- if you're trying to feed it an image and it expects a PDF that's where the issue may lie. If you use a scanning app like Adobe Scan, it will scan the page for you and it will look like a printer scan and it is a pdf. Feed that to Nougat.

If you still do not get a result/a good result that's because your image is not similar enough to the training data. E.g. if you get something like [Missing page: {numer}] then it's because Nougat didn't detect any content there which is similar enough to it's training data, or it skipped it because it thinks it's an empty page. That's what I understand of it as of right now.

If you only need to detect equations you might be better off with Lukas Blecher's LaTeXOCR library available here: https://github.com/lukas-blecher/LaTeX-OCR. Works like Mathpix Snip tool for equations.

gk966988 commented 1 year ago

It is feasible to directly process images. In data processing, the PDF is first converted into images before being input into the model. You can pay attention to the processing flow of the "LazyDataset" method and make corresponding modifications.

marwinsteiner commented 1 year ago

Actually, FWIW, why did you choose to use Arxiv papers as a dataset as opposed to generalizing certain features found in academic papers like text (in different languages), block-, and inline LaTeX equations, tables, etc., and then generate a dataset using a separate generator script as opposed to pulling down PDFs from Arxiv which may or may not contain said generalized characteristics...?

Generating the dataset based on a few generalized characteristics could maybe improve the quality of output across a) different languages and b) different subject domains, e.g. improved performance on a pure text PDF in the humanities, rather than (only) a STEM paper

Would this be an argument for fine-tuning?

XFastDataLab commented 1 year ago

I made two experiments. One is to save a page from a PDF paper to be an image, Nougat performs well to identifiy all words and equation. The second, I took a photo and convert it to be an image that is similar to be a page of a paper, then Nougat failed this time.

Yashar78 commented 1 year ago

Did you do all the necessary processing before feeding it to nougat ?

Yashar78 commented 1 year ago

I guess there is a little chance of it working well for images that doesn't looks like the training, in terms of quality and looks.

NickDatLe commented 1 year ago

Did you do all the necessary processing before feeding it to nougat ?

What processing are you performing to increase accuracy of nougat?