How can one seperate the markdown output of nougat from the latex output(which mostly contains the tabluar data)

bp-high commented 1 year ago

While using Nougat(using the stable, 0.1.8 version, https://pypi.org/project/nougat-ocr/0.1.8/) I have seen that the final output comes as two components:- One markdown and other the tables/tabular data in latex format like the one below:- \begin{table} \begin{tabular}{l|l c c c c c} \hline \hline Method & Modality & Edit distance $\downarrow$ & BLEU $\uparrow$ & METEOR $\uparrow$ & Precision $\uparrow$ & Recall $\uparrow$ & F1 $\uparrow$ \ \hline PDF & All & 0.255 & 65.8 & 82.1 & 77.1 & 81.4 & 79.2 \ \hline GROBID & All & 0.312 & 55.6 & 71.9 & 74.0 & 72.1 & 73.0 \ \cline{2-7} & Tables & 0.626 & 25.1 & 64.5 & 61.4 & 80.7 & 69.7 \ + LaTeX OCR & Plain text & 0.363 & 57.4 & 69.2 & 82.1 & 70.5 & 75.9 \ & Math & 0.727 & 0.3 & 5.0 & 11.0 & 8.6 & 9.7 \ \hline \multirow{4}{}{Nougat small (250M${}^{}$)} & All & 0.073 & 88.9 & 92.8 & 93.6 & 92.2 & 92.9 \ \cline{2-7} & Tables & 0.220 & 68.5 & 78.6 & 75.0 & 79.8 & 77.3 \ \cline{1-1} & Plain text & 0.058 & 91.0 & 94.3 & 96.1 & 95.3 & 95.7 \ \cline{1-1} & Math & 0.117 & 56.0 & 74.7 & 77.1 & 76.8 & 76.9 \ \hline \multirow{4}{}{Nougat base (350M${}^{}$)} & All & 0.071 & 89.1 & 93.0 & 93.5 & 92.8 & 93.1 \ \cline{1-1} & Tables & 0.211 & 69.7 & 79.1 & 75.4 & 80.7 & 78.0 \ \cline{1-1} & Plain text & 0.058 & 91.2 & 94.6 & 96.2 & 95.3 & 95.7 \ \cline{1-1} & Math & 0.128 & 56.9 & 75.4 & 76.5 & 76.6 & 76.5 \ \hline \hline \end{tabular} \end{table}

I wanted to ask if there is any simple method to detect latex components in a string and separate them out from the other markdown data?

lukas-blecher commented 1 year ago

since there are always \begin{table} \end{table} tags (as long as the output doesn't grow beyond the max token length) you can just split at these substrings. or use a regex to extract the tables

KartavyaBagga commented 1 year ago

since there are always \begin{table} \end{table} tags (as long as the output doesn't grow beyond the max token length) you can just split at these substrings. or use a regex to extract the tables

@lukas-blecher How to change the max token length from 4096 to more ?

--full-precision, base model, -no-skipping

For a 100 page pdf it skipped some part of a page & instead repeated that part so much

facebookresearch / nougat

How can one seperate the markdown output of nougat from the latex output(which mostly contains the tabluar data) #111