facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.81k stars 560 forks source link

ERROR:root:missing reference detected #190

Open HGGshiwo opened 8 months ago

HGGshiwo commented 8 months ago

I have error when using split_htmls_to_pages, process https://ar5iv.labs.arxiv.org/html/1110.5321 the error message is ERROR:root:missing reference detected I find the error is caused by "br" and "LABEL:eq1", in latexlml_parser.py, line 175, the "br" and "LABEL:eq1" is not numeric or have href, so the resolved is False,I think it is common that the reference is not a number. Can you find a way to solve it please?

By the way, the total convert ratio is around 17%(about 500,000 pairs of pdf and html),is this normal?