Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.55k stars 100 forks source link

'dict' object has no attribute 'name' #74

Closed qkxie closed 3 weeks ago

qkxie commented 1 month ago

Initial Checks

Description

use your example code, get exception.

Traceback (most recent call last):
  File "/Users/qkxie/Project/test_open_parse/test.py", line 5, in <module>
    parsed_basic_doc = parser.parse(basic_doc_path)
  File "/Users/qkxie/.local/share/virtualenvs/test_open_parse-oL_mu9SY/lib/python3.9/site-packages/openparse/doc_parser.py", line 100, in parse
    text_elems = text.ingest(doc, parsing_method=text_engine)
  File "/Users/qkxie/.local/share/virtualenvs/test_open_parse-oL_mu9SY/lib/python3.9/site-packages/openparse/text/parse.py", line 19, in ingest
    return pdfminer.ingest(doc)
  File "/Users/qkxie/.local/share/virtualenvs/test_open_parse-oL_mu9SY/lib/python3.9/site-packages/openparse/text/pdfminer/core.py", line 181, in ingest
    mime_type = get_mime_type(e)
  File "/Users/qkxie/.local/share/virtualenvs/test_open_parse-oL_mu9SY/lib/python3.9/site-packages/openparse/text/pdfminer/core.py", line 67, in get_mime_type
    subtype = pdf_object.stream.attrs.get("Subtype", {"name": None}).name
AttributeError: 'dict' object has no attribute 'name'

Example Code

import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

### Python, open-parse & OS Version

```Text
python_version: 3.9.6
             operating_system: Darwin
                   os_version: 23.6.0
           open-parse version: 0.6.0
                 install path: /Users/qkxie/.local/share/virtualenvs/test_open_parse-oL_mu9SY/lib/python3.9/site-packages/openparse
               python version: 3.9.6 (default, Feb  3 2024, 15:58:27)  [Clang 15.0.0 (clang-1500.3.9.4)]
                     platform: macOS-14.6.1-arm64-arm-64bit
             related packages: pydantic-2.9.2 PyMuPDF-1.24.12
NuiMrme commented 1 month ago

this seems to be pdfminer issue, see here , however pymupdf works well for me, to activate it do parser.parse(file, ocr=true)

Filimoa commented 3 weeks ago

Fixed in 0.6.1