facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.9k stars 565 forks source link

The markdown file is empty #7

Open parkLGW opened 1 year ago

parkLGW commented 1 year ago

after run this command: nougat xxx.pdf , I got the mmd file, but there is no content in this file. What's the reason? image

ralph-bot commented 1 year ago

what the languange of the content of the pdf . It seems only support english yet

parkLGW commented 1 year ago

well,the pdf is chinese. Thanks for your reply

lucasjinreal commented 1 year ago

How to view the mmd file

rishabh10gpt commented 1 year ago

How to view the mmd file

Hey @lucasjinreal, you can use Mathpix Markdown extention in VS Code.

viviayi commented 1 year ago

I got [MISSING_PAGE_EMPTY:1] too, and there was no chinese in my document, only mathmetics equations

SuperMaxine commented 1 year ago

In my output results, I also encountered the error [MISSING_PAGE_FAIL:xxx], but it's not consistent. Instead, it appears sporadically within some of the output results. Some PDFs only yield a small number of errors, while others have more than half of their pages incorrectly displayed due to MISSING_PAGE_FAIL after processing. Additionally, in the command line, I noticed that the count of WARNING:root:Found repetitions in sample xxx and WARNING:root:Skipping page xxx due to repetitions. seems to correlate with the number of MISSING_PAGE_FAIL instances in the results. I'm curious about what characteristics this has to do with PDFs as currently I haven't found any pattern.

timdingman-scale commented 1 year ago

Seeing this error a lot. I've attached two examples that consistently produce this error. solar.pdf units.pdf

lukas-blecher commented 1 year ago

Seeing this error a lot. I've attached two examples that consistently produce this error. solar.pdf units.pdf

None of these images resemble an academic document. Nougat was trained on mostly arxiv papers (which are predominantly in English). There is some generalization to different document types eg of older papers, but it is expected that input images that differ from the training domain too much won't get recognized.

jchopap commented 1 year ago

I have been using this on some pdfs, I am primarily seeing below issues, could you help me with the way forward for these? I had seen MISSING_PAGE_FAIL error at many places, so I added no-skipping argument while running the inference, but with this I am seeing:

  1. Many repetitive text loops (entire pages getting repeated many times)
  2. Seeing MISSING_PAGE_POST error (even when the page is not empty)
  3. Many-a-times, lot of random text is getting extracted (which is not present in the PDF)
  4. Content that cannot be extracted includes plain paragraphs, tables, images, table of contents, etc.

Could you please suggest ways to move forward here?

Note: My inference data resembles the structure of academic documents.

atheeralattar commented 1 year ago

Here is my file, I am getting [MISSING_PAGE_EMPTY:1] formula.pdf

here is the output



where \(\tau\) is the delay time.

Cao's method [64] computes \(E_{1}\) and \(E_{2}\) for the data set of dimension 1 up to a dimension of \(D\), which is the largest embedding dimension, used for calculate. \(E_{1}\) and \(E_{2}\) defined as follows:

\[E_{1}(d)=\frac{1}{N-d\tau}\left|\sum_{i=1}^{N-d\tau}\left|x_{i+ dt}-x_{n(i,d)+dt}\right|\right| \tag{5.90}\] \[E_{2}(d)=E_{1}(d+1)/E_{1}(d) \tag{5.91}\]

wherein \(d\) is the embedding dimension, \(N\) is the number of data points, \(\tau\) is the embedding delay, \(x_{i+dt}\) and \(x_{n(i,d)+dt}\) is the \(i\)-\(th\) vector in the data sets and its nearest neighbors of d-dimensional phase space.

##### 5.6.1.2 Largest Lyapunov Exponent (LLE)

The basic characteristics of chaotic motion are that the movement is extremely sensitive to initial conditions, two very close initial values resulting in orbit over time by separating exponentially, Lyapunov exponent [66, 67] that describes the amount of this phenomenon.

We use the algorithm of Rosenstein et al. [67] to calculate the LLE. The results were carried out with Tisean package [68], version 3.01. Consider the representation of the time series data as a trajectory in the embedding space, and assume that observe a very close return \(s_{n^{\prime}}\) to a previously visited point \(s_{n}\). Then consider the distance \(\Delta_{0}=s_{n}-s_{n^{\prime}}\) as a small perturbation, \(\Delta l=s_{n+l}-s_{n^{\prime}+l}\). If one finds that \(\left|\Delta_{l}\right.\mid\approx\Delta_{0}e^{\Delta l}\) then \(\lambda\) is the largest Lyapunov exponent.

Assuming \(S(\varepsilon,m,t)\) exhibits a linear increase with identical slope for all \(m\) larger than some \(m_{0}\) and for a reasonable range of \(\varepsilon\), and then this slope can be taken as an estimate of the largest exponent.

\[S(\varepsilon,m,t)=\left\{\,\ln\left(\frac{1}{u_{n}}\sum_{s_{n^{\prime}}\in u _{n}}\left|s_{n+t}-s_{n^{\prime}+l}\right|\right)\right\}_{n} \tag{5.92}\]

##### 5.6.1.3 Correlation Dimension

The correlation dimension method is used for detecting the presence possibility of chaos. An algorithm proposed by Grassberger and Procaccia [65] is the most```