create_index raises AttributeError

OrianeN commented 1 year ago

I'm trying to follow the README guidelines to prepare a dataset starting with your paper as sample PDF/LaTeX input.

I've successfully run your command python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs which resulted in the following folder structure:

root/
├─ folder_paired/
│  ├─ sample_paper/
│  │  ├─ 01.mmd
│  │  ├─ 01.png
│  │  ├─ 04.mmd
│  │  ├─ 04.png
│  │  ├─ 05.mmd
│  │  ├─ 05.png
│  │  ├─ meta.json

However, when I run python3 -m nougat.dataset.create_index --dir /path/to/root/folder_paired/ --out /path/to/index.jsonl, I get the following error:

0%|                                                     | 0/1 [00:00<?, ?it/s]
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pebble/common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nougat/dataset/create_index.py", line 70, in index_paper
    meta = read_metadata(json.load(meta_file.open("r", encoding="utf-8")))
  File "/usr/local/lib/python3.10/dist-packages/nougat/dataset/create_index.py", line 45, in read_metadata
    p = item.pop("page", None)
AttributeError: 'str' object has no attribute 'pop'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/nougat/dataset/create_index.py", line 173, in <module>
    create_index(args)
  File "/usr/local/lib/python3.10/dist-packages/nougat/dataset/create_index.py", line 129, in create_index
    res = tasks[fname].result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
AttributeError: 'str' object has no attribute 'pop'

I tried to dig by looking at the code and the meta.json file: https://github.com/facebookresearch/nougat/blob/47c77d70727558b4a2025005491ecb26ee97f523/nougat/dataset/create_index.py#L43-L45

{"pdffigures": {"figures": [{"caption": "Figure 5: Example of a page with many mathematical equations taken from [41]. Left: Image of a page in the document, Right: Model output converted to LaTeX and rendered to back into a PDF. Examples of scanned documents can be found in the appendix B.", "captionBoundary": {"x1": 63.63800048828125, "x2": 532.8878173828125, "y1": 498.288330078125, "y2": 526.5870361328125}, "figType": "Figure", "imageText": ["Replica", "calculation", "of", "the", "generalization", "error", etc..., "("], "name": "5", "page": 5, "regionBoundary": {"x1": 63.0, "x2": 539.0, "y1": 210.8900146484375, "y2": 490.8900146484375}} etc...

My meta.json file does contain the "pdffigures" key, and it's value is not empty.

Yet, the code line for item in data["pdffigures"] is probably returning the list of keys here as in this code snippet, instead of the values:

d = {"a": 1, "b": 2, "c": 3}
for item in d:
    print(item)

Output :

a
b
c

Navigating further into the JSON file, I've noticed that the "page" element is under data["pdffigures"]["figures"][0]["page"], and the following is working:

for item in data["pdffigures"]["figures"]:
    print(item["page"])

Output:

This can be due to my version of pdffigures2 - I've compiled the JAR based on the fork for PR https://github.com/allenai/pdffigures2/pull/51 as sbt assembly didn't work in the official repository.

Can you confirm that the meta.json format is different from the one you used to train the released Nougat models ? Can you indicate how you compiled pdffigures2's JAR, or else can you provide a fix to run this script with the fix pdffigures2 JAR ?

lilingxi01 commented 1 year ago

For me, I just simply copy the create_index.py out, fix the indexing issue, and execute it outside the Nougat package. The output JSONL file with this fix looks fine.

Replace create_index.py line 44 from

        for item in data["pdffigures"]:

to

        for item in data["pdffigures"]["figures"]:

OrianeN commented 1 year ago

I didn't reply but that's what I did as well, thanks for sharing @lilingxi01.

I noticed recently that the meta file is not a result of pdffigures but it is created by the nougat function split_markdown: https://github.com/facebookresearch/nougat/blob/47c77d70727558b4a2025005491ecb26ee97f523/nougat/dataset/split_md_to_pages.py#L279

CHENG-EMMA1 commented 8 months ago

Hi, the final generated pdf and mmd pairs, does the mmd file not contain images and table information?

facebookresearch / nougat

create_index raises AttributeError #140