Detail about generating dataset

akadoubleone commented 1 year ago

I am tring to generate dataset, including process .tex to .html by LaTeXML and run nougat.dataset.split_htmls_to_pages, but I got some problems. problems:

If .tex has begin{figure} block, no page would be recognized.
No page would be recognized without the --nocrossref parameter in the process of latexmlpost; but if I use the --nocrossref parameter, the content in the reference such as the number in the square bracket would disappear like image below .

Env: ubuntu 1604 & 2204 LaTeXML = 0.8.6 & 0.8.7 Python = 3.10 Tex and PDF file source: https://arxiv.org/abs/adap-org/9912004

Question:

How to solve these problems I mentioned?
How do we achieve features in paper, including replacing user-defined macros, standardizing whitespace, adding optional brackets, normalizing tables, and replacing references and citations with their correct numbers? Does LaTeXML achieve some of them and how?
Is there any recommendation version for LaTeXML? Can you please provide commands about LaTeXML including latexml and latexmlpost?

lukas-blecher commented 1 year ago

We used engrafo to call LaTexML and no latexmlpost at all. Try it like that again.

Yes latexml does replace user-defined macros and the like. Tables are handled later in the conversion from html to mmd with our parser.

We used LaTeXML v0.8.4 but I think newer ones shouldn't be a problem.

1398listener commented 1 year ago

Thank you for your contributions to this research field.

I follow your configuration: Using engrafo to get index.html (All settings are default) Using pdffigures2.jar to get figures json files The directory structure is as follows:

├── figures
│   ├── 1706.03762.json
│   ├── 2308.13418.json
│   ├── 2309.00916.json
│   └── 2309.07900.json
├── htmls
│   ├── 1706.03762.html
│   ├── 2308.13418.html
│   ├── 2309.00916.html
│   └── 2309.07900.html
├── pdfs
│   ├── 1706.03762.pdf
│   ├── 2308.13418.pdf
│   ├── 2309.00916.pdf
│   └── 2309.07900.pdf

And by using this command:

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

I can only get 4 recognized pages:

ERROR:root:unusable reference "structuredAttentionNetworks, "
ERROR:root:unusable reference "2020a"
INFO:root:2308.13418: 3/17 pages recognized. Percentage: 17.65%
 25%|██████████████████████████████████████████████▌                                                                                                                                           | 1/4 [00:03<00:09,  3.01s/it]
INFO:root:1706.03762: 0/15 pages recognized. Percentage: 0.00%
INFO:root:2309.07900: 0/13 pages recognized. Percentage: 0.00%
INFO:root:2309.00916: 1/11 pages recognized. Percentage: 9.09%
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s]
INFO:root:In total: 4/56 pages recognized. Percentage: 7.14%

I wonder why the recognition rate is so low. Thanks again.

rtz19970824 commented 10 months ago

Hi there,

I'm wondering if you deal with the mismatch of the reference between the output of latexml and original pdf. For example, in 2308.00002, the original pdf provide the reference with author's name and year while the output of latexml provide the reference with the number. In 2308.00082, the original pdf use the number while the output of latexml seems to use the bibitem's identifier. Any information will be helpful.

Thanks in advance!

crescent73 commented 10 months ago

您好，我已收到您的来信!Hello, I've received your letter!

CHENG-EMMA1 commented 6 months ago

Thank you for your contributions to this research field.

I follow your configuration: Using engrafo to get index.html (All settings are default) Using pdffigures2.jar to get figures json files The directory structure is as follows:

├── figures
│   ├── 1706.03762.json
│   ├── 2308.13418.json
│   ├── 2309.00916.json
│   └── 2309.07900.json
├── htmls
│   ├── 1706.03762.html
│   ├── 2308.13418.html
│   ├── 2309.00916.html
│   └── 2309.07900.html
├── pdfs
│   ├── 1706.03762.pdf
│   ├── 2308.13418.pdf
│   ├── 2309.00916.pdf
│   └── 2309.07900.pdf

And by using this command:

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

I can only get 4 recognized pages:

ERROR:root:unusable reference "structuredAttentionNetworks, "
ERROR:root:unusable reference "2020a"
INFO:root:2308.13418: 3/17 pages recognized. Percentage: 17.65%
 25%|██████████████████████████████████████████████▌                                                                                                                                           | 1/4 [00:03<00:09,  3.01s/it]
INFO:root:1706.03762: 0/15 pages recognized. Percentage: 0.00%
INFO:root:2309.07900: 0/13 pages recognized. Percentage: 0.00%
INFO:root:2309.00916: 1/11 pages recognized. Percentage: 9.09%
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s]
INFO:root:In total: 4/56 pages recognized. Percentage: 7.14%

I wonder why the recognition rate is so low. Thanks again.

Are you done processing now? Does the markdwon after following the steps omit the images and tables? Also not at the end of the markdown.

crescent73 commented 6 months ago

您好，我已收到您的来信!Hello, I've received your letter!

facebookresearch / nougat

Detail about generating dataset #10