Improve prompts for document structure extraction via an LLM

Context

An important goal of Demokratis is to make it easy for people and organisations to provide feedback (statements, Stellungnahmen) on consultations. To facilitate writing comments or suggesting edits on long legal documents, we need to break them apart into sections, paragraphs, lists, footnotes etc. Since all the consultation documents we can currently access are PDFs, it is surprisingly hard to extract machine-readable structure from them!

Our first experiment

For shorter documents, the most workable solution seems to be to prompt GPT-4o to analyse a whole uploaded PDF file and emit the extracted structure in JSON. It may be possible to make this work for longer documents too with careful chunking. In initial tests, GPT-4o performed better at this task than Gemini 1.5 Pro. See our starting prompt for GPT-4o here along with sample input and output.

The issue

We need to improve this prompt, or use a sequence of prompts, to get the correct structure out of the LLM. The current results appear promising, but parts of the documents are missing and not all requirements are being followed. The right solution may involve providing examples to the LLM, using more than one prompt, validating the output with Python,... or maybe finding that o1 or Claude or Mistral or yet another LLM does this better than GPT-4o.

Demokratis-ch / demokratis-ml