emcf / thepipe

Extract clean data from anywhere, powered by vision-language models ⚡
https://thepi.pe
MIT License
1.19k stars 77 forks source link

extract by chunk is not working #31

Open zarlicho opened 1 month ago

zarlicho commented 1 month ago

I want to extract a chunk from json with the extract_from_chunk() function but I get an error like this ({'chunk_index': 4, 'source': 'pdf', 'error': "'list' object has no attribute 'to_message'"}, 0)

print(thepipe.extract.extract_from_chunk(chunk=chunk,chunk_index=4,schema="bill_name",ai_model='openai/gpt-40',source='pdf',multiple_extractions=True,extraction_prompt=prompting,host_images=True))

emcf commented 1 month ago

extract_from_chunk expects a single chunk object, not a list. Also, be sure to pass a dictionary in as the schema, not a string. To avoid this type of issue you can actually specify the chunking method inside the extraction function:

from thepipe.chunker import chunk_by_page

results = extract_from_file(
    "example.pdf",
    schema={"section_title": "string", "content": "string"},
    chunking_method=chunk_by_page
)