Open OrianeN opened 1 year ago
For me, I just simply copy the create_index.py
out, fix the indexing issue, and execute it outside the Nougat package. The output JSONL file with this fix looks fine.
Replace create_index.py
line 44 from
for item in data["pdffigures"]:
to
for item in data["pdffigures"]["figures"]:
I didn't reply but that's what I did as well, thanks for sharing @lilingxi01.
I noticed recently that the meta file is not a result of pdffigures but it is created by the nougat function split_markdown: https://github.com/facebookresearch/nougat/blob/47c77d70727558b4a2025005491ecb26ee97f523/nougat/dataset/split_md_to_pages.py#L279
Hi, the final generated pdf and mmd pairs, does the mmd file not contain images and table information?
I'm trying to follow the README guidelines to prepare a dataset starting with your paper as sample PDF/LaTeX input.
I've successfully run your command
python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs
which resulted in the following folder structure:However, when I run
python3 -m nougat.dataset.create_index --dir /path/to/root/folder_paired/ --out /path/to/index.jsonl
, I get the following error:I tried to dig by looking at the code and the meta.json file: https://github.com/facebookresearch/nougat/blob/47c77d70727558b4a2025005491ecb26ee97f523/nougat/dataset/create_index.py#L43-L45
My
meta.json
file does contain the "pdffigures" key, and it's value is not empty.Yet, the code line
for item in data["pdffigures"]
is probably returning the list of keys here as in this code snippet, instead of the values:Output :
Navigating further into the JSON file, I've noticed that the
"page"
element is underdata["pdffigures"]["figures"][0]["page"]
, and the following is working:Output:
This can be due to my version of pdffigures2 - I've compiled the JAR based on the fork for PR https://github.com/allenai/pdffigures2/pull/51 as
sbt assembly
didn't work in the official repository.Can you confirm that the meta.json format is different from the one you used to train the released Nougat models ? Can you indicate how you compiled pdffigures2's JAR, or else can you provide a fix to run this script with the fix pdffigures2 JAR ?