facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.83k stars 561 forks source link

Detail about generating dataset #10

Open akadoubleone opened 1 year ago

akadoubleone commented 1 year ago

I am tring to generate dataset, including process .tex to .html by LaTeXML and run nougat.dataset.split_htmls_to_pages, but I got some problems. problems:

Env: ubuntu 1604 & 2204 LaTeXML = 0.8.6 & 0.8.7 Python = 3.10 Tex and PDF file source: https://arxiv.org/abs/adap-org/9912004

Question:

lukas-blecher commented 1 year ago

We used engrafo to call LaTexML and no latexmlpost at all. Try it like that again.

Yes latexml does replace user-defined macros and the like. Tables are handled later in the conversion from html to mmd with our parser.

We used LaTeXML v0.8.4 but I think newer ones shouldn't be a problem.

1398listener commented 1 year ago

Thank you for your contributions to this research field.

I follow your configuration: Using engrafo to get index.html (All settings are default) Using pdffigures2.jar to get figures json files The directory structure is as follows:

├── figures
│   ├── 1706.03762.json
│   ├── 2308.13418.json
│   ├── 2309.00916.json
│   └── 2309.07900.json
├── htmls
│   ├── 1706.03762.html
│   ├── 2308.13418.html
│   ├── 2309.00916.html
│   └── 2309.07900.html
├── pdfs
│   ├── 1706.03762.pdf
│   ├── 2308.13418.pdf
│   ├── 2309.00916.pdf
│   └── 2309.07900.pdf

And by using this command:

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

I can only get 4 recognized pages:

ERROR:root:unusable reference "structuredAttentionNetworks, "
ERROR:root:unusable reference "2020a"
INFO:root:2308.13418: 3/17 pages recognized. Percentage: 17.65%
 25%|██████████████████████████████████████████████▌                                                                                                                                           | 1/4 [00:03<00:09,  3.01s/it]
INFO:root:1706.03762: 0/15 pages recognized. Percentage: 0.00%
INFO:root:2309.07900: 0/13 pages recognized. Percentage: 0.00%
INFO:root:2309.00916: 1/11 pages recognized. Percentage: 9.09%
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s]
INFO:root:In total: 4/56 pages recognized. Percentage: 7.14%

I wonder why the recognition rate is so low. Thanks again.

rtz19970824 commented 10 months ago

Hi there,

I'm wondering if you deal with the mismatch of the reference between the output of latexml and original pdf. For example, in 2308.00002, the original pdf provide the reference with author's name and year while the output of latexml provide the reference with the number. In 2308.00082, the original pdf use the number while the output of latexml seems to use the bibitem's identifier. Any information will be helpful.

Thanks in advance!

crescent73 commented 10 months ago

您好,我已收到您的来信!Hello, I've received your letter!

CHENG-EMMA1 commented 6 months ago

Thank you for your contributions to this research field.

I follow your configuration: Using engrafo to get index.html (All settings are default) Using pdffigures2.jar to get figures json files The directory structure is as follows:

├── figures
│   ├── 1706.03762.json
│   ├── 2308.13418.json
│   ├── 2309.00916.json
│   └── 2309.07900.json
├── htmls
│   ├── 1706.03762.html
│   ├── 2308.13418.html
│   ├── 2309.00916.html
│   └── 2309.07900.html
├── pdfs
│   ├── 1706.03762.pdf
│   ├── 2308.13418.pdf
│   ├── 2309.00916.pdf
│   └── 2309.07900.pdf

And by using this command:

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

I can only get 4 recognized pages:

ERROR:root:unusable reference "structuredAttentionNetworks, "
ERROR:root:unusable reference "2020a"
INFO:root:2308.13418: 3/17 pages recognized. Percentage: 17.65%
 25%|██████████████████████████████████████████████▌                                                                                                                                           | 1/4 [00:03<00:09,  3.01s/it]
INFO:root:1706.03762: 0/15 pages recognized. Percentage: 0.00%
INFO:root:2309.07900: 0/13 pages recognized. Percentage: 0.00%
INFO:root:2309.00916: 1/11 pages recognized. Percentage: 9.09%
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s]
INFO:root:In total: 4/56 pages recognized. Percentage: 7.14%

I wonder why the recognition rate is so low. Thanks again.

Are you done processing now? Does the markdwon after following the steps omit the images and tables? Also not at the end of the markdown.

crescent73 commented 6 months ago

您好,我已收到您的来信!Hello, I've received your letter!