Zipstack / unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
https://unstract.com
GNU Affero General Public License v3.0
2.46k stars 150 forks source link

incomplete extracted result #717

Open haluwong opened 1 month ago

haluwong commented 1 month ago

Hi All, we have a 7 pages pdf which is a delivery note and we would like to get the item information on it. There are 13 items but unstract only extract 6 items. I can use the prompt to get the total number of items, meaning all the pages are extracted.

But for the details, it cannot extract all the data. Here is my prompt:

Extract the following details from the text and format them into JSON:

Part Number: The value that appears immediately before "UPC:". Ensure it is not the value after "CPU:". (e.g., 960-001312, PC-LABEL, UCSC-C220-M6S)
Ship Qty
Order Qty
SKU
Description
Serial Numbers
Return the result in JSON format as an array of objects, each containing:

"part_number"
"order_qty"
"ship_qty"
"sku"
"description"
"serial_numbers"

screen_20240920_03

screen_20240920_05

item after "007" cannot be extracted. is there any limitation on the output size?

Here is the json output for the above prompt result.json

VikashPratheepan commented 1 month ago

Yes @haluwong - The gpt-4 model is having an output token limit of 4096. You need to choose a model with higher output token limit.

ashwanthkumar commented 3 weeks ago

@VikashPratheepan -- Curious, how do we handle data extraction that is larger than LLM model's output token limit? I mean most LLMs are going big in input size and not so much on Output.

shuveb commented 3 weeks ago

@ashwanthkumar we handle this by internally splitting the context, making multiple requests and responding with a concatenated result. However, this feature is only available in the enterprise version.

cwikio commented 6 days ago

I have the same problem. This should never happen since chatgpt simply asks you if you want to proceed. why is this feature not inbuild into the software? It makes it nearly useless for anything of useful size, perhaps apart from receipts and short bank statements.