NationalLibraryOfNorway / meteor

A python module and REST API for automatic extraction of metadata from PDF files
Apache License 2.0
11 stars 2 forks source link

WIP: Metadata extraction using LLM API service #29

Open osma opened 3 months ago

osma commented 3 months ago

This draft PR contains an initial rough implementation of the LLM-based metadata extraction described in #21. Because some functionality is still missing and there are uncertainties in the implementation, I'm leaving it as a Draft PR.

The initial prototyping was performed in this Jupyter notebook that has essentially the same functionality, but in this PR the code from the notebook has been retrofitted into the Meteor codebase.

How it works

This code adds a new LLMExtractor class that performs the main work of metadata extraction by calling an LLM API service such as llama.cpp running locally. Here is an outline of the changes:

How it looks like

There is a new select element for choosing the backend:

image

How to test it

  1. Install llama.cpp on your computer (on Linux, using CPU only: git clone the repository and run make to compile it)
  2. Download a fine-tuned model such as NatLibFi/Qwen2-0.5B-Instruct-FinGreyLit-GGUF in GGML format i.e. Qwen2-0.5B-Instruct-FinGreyLit-Q4_K_M.gguf
  3. Start the llama.cpp server using the GGUF model: ./llama-server -m Qwen2-0.5B-Instruct-FinGreyLit-Q4_K_M.gguf and leave it running
  4. Set the environment variable LLM_API_URL to point to the llama.cpp server API endpoint: export LLM_API_URL=http://localhost:8080 (or edit the .env file)
  5. Start up Meteor from this branch and run it normally. In the UI, select "LLMExtractor" as the extraction method. If performing API calls, set the parameter backend=LLMExtractor.

Example using a Norwegian document in English language:

$ time curl -d fileUrl=https://www.ssb.no/forside/_attachment/453137 -d backend=LLMExtractor http://127.0.0.1:5000/json
{"year":{"origin":{"type":"LLM"},"value":"2021"},"language":{"origin":{"type":"LLM"},"value":"eng"},"title":{"origin":{"type":"LLM"},"value":"Family composition and transitions into long-term care services among the elderly"},"publisher":{"origin":{"type":"LLM"},"value":"Statistics Norway"},"publicationType":null,"authors":[{"origin":{"type":"LLM"},"firstname":"Astri","lastname":"Syse"},{"origin":{"type":"LLM"},"firstname":"Alyona","lastname":"Artamonova"},{"origin":{"type":"LLM"},"firstname":"Michael","lastname":"Thomas"},{"origin":{"type":"LLM"},"firstname":"Marijke","lastname":"Veenstra"}],"isbn":null,"issn":null}
real    0m19.678s
user    0m0.000s
sys 0m0.015s

Here is the same metadata as shown in the Meteor web UI:

image

As far as I can tell, this metadata is correct, except that the LLM for some reason didn't pick up the ISSN on page 3. But this was using the relatively stupid Qwen2-0.5B based small model, not the larger Mistral-7B based model that gives much better quality responses.

Here is again the same document, but this time the LLM is the larger Mistral-7B based model, quantized to Q6_K GGUF format, running on a V100 GPU using llama.cpp, all 33 layers offloaded to the GPU, requiring around 12.5GB VRAM.

time curl -d fileUrl=https://www.ssb.no/forside/_attachment/453137 -d backend=LLMExtractor http://127.0.0.1:5000/json
{"year":{"origin":{"type":"LLM"},"value":"2021"},"language":{"origin":{"type":"LLM"},"value":"eng"},"title":{"origin":{"type":"LLM"},"value":"Family composition and transitions into long-term care services among the elderly"},"publisher":{"origin":{"type":"LLM"},"value":"Statistics Norway"},"publicationType":null,"authors":[{"origin":{"type":"LLM"},"firstname":"Astri","lastname":"Syse"},{"origin":{"type":"LLM"},"firstname":"Alyona","lastname":"Artamonova"},{"origin":{"type":"LLM"},"firstname":"Michael","lastname":"Thomas"},{"origin":{"type":"LLM"},"firstname":"Marijke","lastname":"Veenstra"}],"isbn":null,"issn":{"origin":{"type":"LLM"},"value":"1892-753X"}}
real    0m3.358s
user    0m0.002s
sys     0m0.003s

Note that the request now completed in 3.4 seconds (including downloading the PDF) and this time the ISSN was successfully extracted as well.

Missing functionality

Code/implementation issues

Other potential problems

Fixes #21

osma commented 3 months ago

I've added a few more features to the original draft implementation (better configuration handling + ability to select the backend in the web UI and in API methods). I've updated the OP accordingly.