Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

nodes output #7

Closed atgreen closed 5 months ago

atgreen commented 5 months ago

Probably user error... but when I run your sample program in regular python, the printed nodes don't look like structured node output I expect. Instead I get...

elements=(TextElement(text='Applicant Information:\nIndividual in Personal Capacity\nOrganization Name (if applicable):\nFirst Name:\nJohn\nLast Name:\nSmith', lines=(LineElement(bbox=(62.08, 155.52, 192.99, 168.75), spans=(TextSpan(text='Applicant Information:', is_bold=False, is_italic=False, size=13.23),), style=None, text='Applicant Information:'), LineElement(bbox=(62.08, 136.41, 241.87, 149.64), spans=(TextSpan(text='Individual in Personal Capacity', is_bold=False, is_italic=False, size=13.23),), style=None, text='Individual in Personal Capacity'), LineElement(bbox=(62.08, 117.29, 262.4, 130.52), spans=(TextSpan(text='Organization Name (if applicable):', is_bold=False, is_italic=False, size=13.23),), style=None, text='Organization Name (if applicable):'), LineElement(bbox=(62.08, 98.18, 130.0, 111.41), spans=(TextSpan(text='First Name:', is_bold=False, is_italic=False, size=13.23),), style=None, text='First Name:'), LineElement(bbox=(62.08, 79.8, 110.75, 93.03), spans=(TextSpan(text='John', is_bold=False, is_italic=False, size=13.23),), style=None, text='John'), LineElement(bbox=(62.08, 60.69, 129.3, 73.92), spans=(TextSpan(text='Last Name:', is_bold=False, is_italic=False, size=13.23),), style=None, text='Last Name:'), LineElement(bbox=(62.08, 41.58, 96.79, 54.81), spans=(TextSpan(text='Smith', is_bold=False, is_italic=False, size=13.23),), style=None, text='Smith')), 

etc etc

I was expecting the json-looking output from the documentation. What am I looking at?

Filimoa commented 5 months ago

You can serialize them to json using node.model_dump() (uses pydantic under the hood). I will add a section to the docs on it.

Filimoa commented 5 months ago

Added section to README + docs

atgreen commented 5 months ago

Is it really well-formed JSON?

Should... {'variant': {'text'}, ..really be.. {'variant': 'text',

?

Filimoa commented 5 months ago

To get valid json you can run parsed_basic_doc.json(). A node can be composed of text, table or image elements so the variant is a set (or list in json).

Noexpert commented 3 months ago

What is the correct syntax for outputting in json? I'm trying to scrape a pdf table.

>>> parsed_basic_doc.json = parser.parse(basic_doc_path)                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                      
File "<stdin>", line 1, in <module>                                                                                                                                                                                                   
File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 839, in __setattr__                                                                                                                                        
raise ValueError(f'"{self.__class__.__name__}" object has no field "{name}"')                                                                                                                                                     
ValueError: "ParsedDocument" object has no field "json"                                                                                                                                                                               
Filimoa commented 3 months ago

In python you first assign the result to a variable and then call the json property.

from openparse import processing, DocumentParser

semantic_pipeline = processing.SemanticIngestionPipeline(
    openai_api_key=OPEN_AI_KEY,
    model="text-embedding-3-large",
    min_tokens=64,
    max_tokens=1024,
)
parser = DocumentParser(
    processing_pipeline=semantic_pipeline,
)
parsed_content = parser.parse(basic_doc_path)

parsed_content.json()