CatchTheTornado / pdf-extract-api

Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
https://demo.doctractor.com
GNU General Public License v3.0
1.38k stars 92 forks source link

Challenges with LLMs Not Respecting Provided Fields in JSON Outputs #33

Closed kreativitat closed 20 hours ago

kreativitat commented 1 week ago

When utilizing Large Language Models to extract data from documents such as invoices and generate structured outputs like JSON files, a common issue arises: the LLM does not always adhere strictly to the provided fields and sometimes invents new ones. This behavior poses significant challenges for applications that require exact data formats for database integration and other automated processes.

pkarw commented 1 week ago

Hey! Definitely. I think one thing is a optimized prompt - for example including json schema which from my exp. reduces this situation significantly

The second option I guess would be to add some output validator (eg using pydantic)?

Do you see any actionable items you'd like to see next out of this issue? I mean we can work on some specific example to optimize the prompt or maybe you'd like to create a FR (?)

Let's make it more actionable :)

pkarw commented 1 week ago

Context: https://www.reddit.com/r/LocalLLaMA/comments/1eqayuq/how_to_force_llama31_to_respond_with_json_only/?rdt=44352

pkarw commented 1 week ago

The other option is to add output format parameter: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion

It supports json.

Please maybe try this out and create a PR or FR?

I think we could just add proxy parameter output_format to the API and CLI to be used along when prompt is provided

pkarw commented 1 week ago

I've just tried the following one and it worked pretty well:

(.venv) piotrkarwatka@Piotrs-MacBook-Pro-2 pdf-extract-api % python client/cli.py ocr_request --file examples/example-mri.pdf --ocr_cache --prompt "Return only JSON format"
/Users/piotrkarwatka/Projects/pdf-extract-api/client/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
File uploaded successfully. Task Id: ac037626-021a-4d48-a34c-89c6fe4b3168 Waiting for the result...
{'state': 'PENDING', 'status': 'Task is pending...'}
{'state': 'PROGRESS', 'status': 'Processing LLM', 'info': {'progress': 75, 'status': 'Processing LLM', 'elapsed_time': 2.1235079765319824}}
{'state': 'PROGRESS', 'status': 'LLM Processing chunk no: 35', 'info': {'progress': 34, 'status': 'LLM Processing chunk no: 34', 'elapsed_time': 4.134450912475586}}
{'state': 'PROGRESS', 'status': 'LLM Processing chunk no: 125', 'info': {'progress': 125, 'status': 'LLM Processing chunk no: 125', 'elapsed_time': 6.150297164916992}}
{'state': 'PROGRESS', 'status': 'LLM Processing chunk no: 213', 'info': {'progress': 213, 'status': 'LLM Processing chunk no: 213', 'elapsed_time': 8.164186954498291}}
```json
{
  "address": {
    "street1": "0 Maywood Ave.",
    "city": "Maywood",
    "state": "NJ",
    "zip": "0000"
  },
  "practice": {
    "name": "Ikengil Radiology Associa",
    "website": "DikengilRadiologyAssociates.com",
    "phone": "201-725-0913"
  },
  "patient": {
    "names": "Jane, Mary",
    "dob": "1966-00-00",
    "age": 55,
    "sex": "F",
    "accountNumber": "00002"
  },
  "study": {
    "type": "Brain MRI",
    "dateOfService": "2021-04-29"
  },
  "diagnosis": {
    "condition": "Chiari I malformation with 10 mm descent of cerebellar tonsils."
  },
  "imagingTechnique": {
    "description": "Noncontrast MRI of the brain was performed in the three orthogonal planes utilizing T1/T2/T2 FLAIR/T2* GRE/Diffusion-ADC sequences."
  }
}