facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.98k stars 567 forks source link

Some questions about generating dataset #93

Closed 1398listener closed 1 year ago

1398listener commented 1 year ago

Thank you for your contributions to this research field.

I follow your configuration: Using engrafo to get index.html (All settings are default) Using pdffigures2.jar to get figures json files The directory structure is as follows:

├── figures
│   ├── 1706.03762.json
│   ├── 2308.13418.json
│   ├── 2309.00916.json
│   └── 2309.07900.json
├── htmls
│   ├── 1706.03762.html
│   ├── 2308.13418.html
│   ├── 2309.00916.html
│   └── 2309.07900.html
├── pdfs
│   ├── 1706.03762.pdf
│   ├── 2308.13418.pdf
│   ├── 2309.00916.pdf
│   └── 2309.07900.pdf

And by using this command:

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

I can only get 4 recognized pages:

ERROR:root:unusable reference "structuredAttentionNetworks, "
ERROR:root:unusable reference "2020a"
INFO:root:2308.13418: 3/17 pages recognized. Percentage: 17.65%
 25%|██████████████████████████████████████████████▌                                                                                                                                           | 1/4 [00:03<00:09,  3.01s/it]
INFO:root:1706.03762: 0/15 pages recognized. Percentage: 0.00%
INFO:root:2309.07900: 0/13 pages recognized. Percentage: 0.00%
INFO:root:2309.00916: 1/11 pages recognized. Percentage: 9.09%
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s]
INFO:root:In total: 4/56 pages recognized. Percentage: 7.14%

Questions:

  1. Do you use engrafo to process the whole source dir or just the main tex file?
  2. Do you use engrafo's '--no-post-porcessing' setting? (Although I have tried 'post-processing' and 'no-post-processing', the generated htmls are the same.)
  3. My recognized percentage is too low, is this a special case or is it normal? I wonder your recognized percentage.

Thanks again.

lukas-blecher commented 1 year ago

Processing the whole dir with default settings sounds fine. If I remember correctly, I got a percentage of ~45%

Here are my suggestions:

1398listener commented 1 year ago

Thanks a lot, I will have a try.