Some questions about generating dataset

1398listener commented 1 year ago

Thank you for your contributions to this research field.

I follow your configuration: Using engrafo to get index.html (All settings are default) Using pdffigures2.jar to get figures json files The directory structure is as follows:

├── figures
│   ├── 1706.03762.json
│   ├── 2308.13418.json
│   ├── 2309.00916.json
│   └── 2309.07900.json
├── htmls
│   ├── 1706.03762.html
│   ├── 2308.13418.html
│   ├── 2309.00916.html
│   └── 2309.07900.html
├── pdfs
│   ├── 1706.03762.pdf
│   ├── 2308.13418.pdf
│   ├── 2309.00916.pdf
│   └── 2309.07900.pdf

And by using this command:

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

I can only get 4 recognized pages:

ERROR:root:unusable reference "structuredAttentionNetworks, "
ERROR:root:unusable reference "2020a"
INFO:root:2308.13418: 3/17 pages recognized. Percentage: 17.65%
 25%|██████████████████████████████████████████████▌                                                                                                                                           | 1/4 [00:03<00:09,  3.01s/it]
INFO:root:1706.03762: 0/15 pages recognized. Percentage: 0.00%
INFO:root:2309.07900: 0/13 pages recognized. Percentage: 0.00%
INFO:root:2309.00916: 1/11 pages recognized. Percentage: 9.09%
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s]
INFO:root:In total: 4/56 pages recognized. Percentage: 7.14%

Questions:

Do you use engrafo to process the whole source dir or just the main tex file?
Do you use engrafo's '--no-post-porcessing' setting? (Although I have tried 'post-processing' and 'no-post-processing', the generated htmls are the same.)
My recognized percentage is too low, is this a special case or is it normal? I wonder your recognized percentage.

Thanks again.

lukas-blecher commented 1 year ago

Processing the whole dir with default settings sounds fine. If I remember correctly, I got a percentage of ~45%

Here are my suggestions:

Try looking a the generated markdown files (before splitting) and see if you can find any issues there
Try with more examples. It might be that everything works fine and you got unlucky.
You can experiment with the min_score parameter here

1398listener commented 1 year ago

Thanks a lot, I will have a try.

facebookresearch / nougat

Some questions about generating dataset #93