Using doclytics to update the "Content" of a document

hermanmak commented 2 months ago

Paperless-ngx uses an OCR engine that is not particularly good with languages like chinese, korean and especially seems to perform badly when multiple languages are present in the same document.

Multiple language in the same document is extremely common in Hong Kong.

Could doclytics be a bridge to apply LLMs to do the OCR instead of the built in (or overwrite)? For example the new model available called minicpm-v is capable of OCR in multiple languages

Thanks!

B-urb commented 1 month ago

Hi @hermanmak, thanks for bringing this to my attention. I wasn't aware that mixed language documents, are an issue for the paperless-ngx ocr engine. I did a bit of research and it seems like llava could be a good candidate for that. I will have a look into it.

hermanmak commented 1 month ago

I'm wondering, does the processing in doclytics do ocr on the raw image or does it infer based on the text layer of the images?

thiswillbeyourgithub commented 1 month ago

Hi, just discovered doclytics and gotta say I'm very interested.

I'm pretty disatisfied with paperless's OCR quality (not their fault) especially because I tend to take pictures with my phone instead of using a proper scanner.

I wanted to add here that llava models are pretty old now and there are now much better models. The most recent that comes to mind are Qwen2-VL family of models. They exist in both large and small (2B) sizes, seem SOTA for OCR/vision (including handwriting!) and are multinlingual.

Here's an OCR example: https://simonwillison.net/2024/Sep/4/qwen2-vl/

Here's their page: https://github.com/QwenLM/Qwen2-VL

The only issue is that they are not yet supported by ollama, as you can see in this issue because llama.cpp itself does not yet have it (issue).

Allow me to show an example using Qwen2 72B:

Input image:

My prompt: Tell me the exact text in this image while preserving the formatting. Use a single markdown block for the entire document.

This resulted in the following markdown:

# Qwen2-VL 7B Instruct (free) 📦

| qwen/qwen-2-vl-7b-instruct:free 📦 |
|----------------------------------|
| Updated Aug 28                   | 32,768 context              | $0/M input tokens       | $0/M output tokens    | $0/K input imgs     |

### Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:...

- Free 📦

### Model weights 📦

### Overview | Providers | Apps | Activity | Parameters | Uptime | API

### Providers for Qwen2-VL 7B Instruct (free)

OpenRouter [load-balances requests](https://docs.openrouter.ai/docs/developers/providers-and-pools) across providers
weighted by price unless you use [dynamic routing](https://docs.openrouter.ai/docs/developers/routing).

| **Hyperbolic** | Max Output | Input | Output | Latency | Throughput |
|----------------|------------|-------|--------|---------|------------|
| bf16 🚹 ⚙️   | 2,048      | $0    | $0     | 0.95s   | 98.49t/s   |

When passing it through rich:

So yeah this is much much more readable than OCR, be it for humans or for LLMs, making it even easier to do an embedding search.

If you want to give it a try I suggest getting an openrouter api key

thiswillbeyourgithub commented 1 month ago

Edit: I forgot to mention minicpm-v model. It's super good for OCR even at 7B and seems to work on at least french and english.

B-urb commented 1 month ago

I'm wondering, does the processing in doclytics do ocr on the raw image or does it infer based on the text layer of the images?

Right now, it uses the content field of the document from the doclytics api, so it uses the text extracted by the paperless ocr. I will evaluate the input of @thiswillbeyourgithub. Very interesting. I think this could improve the quality.

thiswillbeyourgithub commented 4 weeks ago

Hi,

Here are some recent tests using ollama and minicpm-v:

Here's a random image from the internet:

Here's the prompt I used: Just transcribe the text in this image and preserve the formatting and layout (high quality OCR). Do that for ALL the text in the image. Be thorough and pay attention. This is very important. The image is from a text document so be sure to continue until the bottom of the page. Thanks a lot! You tend to forget about some text i n the image so please focus! Use markdown format. [PATH TO IMAGE]

Here's the output:

**INVOICE**

---

### **CAYLEY CHEMICAL CORPORATION**
10 MANOR PARK, MACKENZIE WAY, CHELTENHAM, GLOS. GL51 9TX, ENGLAND  
TEL: 44 (0) 1242 222791, FAX: 44 (0) 1242 227634, E-MAIL: cayley@btinternet.com

---

### VAT No. 533 4091 66  
**Invoice Reference:** 1234  

---

#### Consignee:
ABC LTD  
FIRST STREET  
NAIROBI  
KENYA  

#### Purchaser (if not Consignee):
MOMBASA  

#### Remittance:  
TELEGRAPHIC REMITTANCE TO :  
BARCLAYS BANK PLC, CHELTENHAM, UK.  
Tax Point & Date: 01 December 2017 Country of Origin: KOREA  
C/A NO: 75993 SWEE: BARCBGB28  

#### A/C NO.: GBAN NO. CBAS4 BCARO 20120157 MY-5S99  

#### Place & Conditions Consignment from:   
**Port of Loading:**  
MSC ABIGAIL  
**Date:** (25.11.17)  
**Country to which consigned:**  
KOREA  

---

#### Port of Discharge:
MOMBASA

---

#### Marked and Numbers : Number kind of Packages Description of Goods:  

- **MT**
  - LABSA, NAIROBI
    - MARKS :
      LINEAR ALKYD BENZENE SULPHONIC ACID, LABSA 96%
        - Quantity:
          1 FCL, 80 X250 KG DRUMS NETT  
        - Selling Price & Value: 
          $/MT Amount  
          23,500.00

- **FOB SUSANAI TO MOMBASA**
    - MARKS :
      LINEAR ALKYD BENZENE SULPHONIC ACID, LABSA 96%
        - Quantity:
          FOB SUZUKI TO MOMBASA  
        - Selling Price & Value: 
          $/MT Amount  
          USD @0.5%  

---

#### TOTAL EXCL VAT

TOTAL : 25,000.00  

---

**DIRECTOR:** P J MILLS BSc  
**COMPANY SECRETARY:** MICHELLE JAMES

It even seems to work fine on vertical images:

**Title:**
Napoleon

---

**Subtitle:**
The Emperor Napoleon in His Study at the Tuileries, 1812  
Emperor of the French  

---

**Body Text:**

- **1st successor:** Louis XVI [Y][4] (born 6 August 1754 – died 20 April 1793)
- **2nd successor:** Louis XVIII[5] (born 20 March 1755 – died 18 June 1816)  
First Consul of the French Republic, Second Consul
- **3rd successor:** Louis XVIII [Y][4]
- **4th successor:** First Consult of the French Republic

---

**Timeline:**
- Born: 20 March 1755 (born April–August), in Corsica  
Aguja, Corsica, Kingdom of France
- Died: 
  - May 8, 1821 (aged [49])
    Longwood Island,
    Saint Helena[6] (Longwood House)  
Carcavelos, Portugal
- Appointments:
  - First Consul of the French Republic, Second Consul

---

**Family Information:**
Spouses:
Josephine de Beauharnais (m. [1796; deceased]; born in Aix-en-Provence on December 23)
Marie Louise de Bourbon[4]  
Marriage to Marie Louise was dissolved after her death.

Issue:
Napoleon II

---

**Footer:**
More info... Napoleon

It can hallucinate things. For example it says napoleon was married to "Marie Louise de Bourbon" but the picture says "Marie Louise of Austria".

Here's my takeaway:

This model allows very fast (depending on your setup) high quality layout-preserved OCR at the cost of sometimes hallucinating stuff and sometimes not reading to the end of the image.
This model is far from the best open weight model, but it's the best we have on ollama. Llama3 vision models are not yet supported. Qwen has IIRC state of the art models but they are not yet in ollama either.
Those tests are to be seen as a low bar of what we can achieve. I've no experience with mini cpm in particular and there's probably better way to ask it. But I have also not even used any chaining of prompts. We could for example ask it 3 times, and keep the longest answer (this makes sure it goes to the end of the document).
Or we could ask it to judge its performance and then take the highest result. I did try a bit Rate how well you did on a scale 0-9 0 means you failed and 9 it's perfect and no text is missing. Reply only the number. and it's a promissing way to squeeze even better resutls. On the napoleon mistake it rated itself at 8 and at 9 without mistakes that I could find.
On dense text, it can hallucinate summaries which is not a bad idea. Say you give it a long dense text and it summarizes with 25% of the lines missing but it still gets the gist and rephrased some things. This could help the user find a document by searching for words that are not in the original document but still in the idea.
In my mind, it would be best (until we get newer better vision LLMs) to not overwrite the 'content' field with that but instead add it before. Something like <ocr_content model=ollama/minipcm backend-doclytics date=[date] backend_version=[version]> [MINIPCM OUTPUT] </ocr_content> [PREVIOUS OCR CONTENT]. This way we could update the OCR content as new model roll out / better prompts are found without having to redo the original OCR. Also allowing to switch vision backend, should the need arise to use something else than doclytics.
Doing that would make it better for other LLMs (text only) to provide tags, fill custom fields etc.
But it would probably vastly enhance embeddings search quality too. I don't yet know of any good apps to search via embeddings on paperless but I think it would work perfectly with that.

What do you think? Any ETA on when we could test it on doclytics?

thiswillbeyourgithub commented 4 weeks ago

Btw here's an update to qwen2 VL 3B support using llama.cpp. It should rather quickly come to ollama once it's merged

hermanmak commented 4 weeks ago

I think Ollama also supports a more streamlined huggingface GGUF model usage now. Basically GGUF models on Huggingface can come prepackaged with the model templates. Previously the hard part was to find/write the model templates before you could use the GGUF model.

-herman

On Mon, Oct 21, 2024 at 7:54 AM thiswillbeyourgithub < @.***> wrote:

Btw here's an update to qwen2 VL 3B https://github.com/ggerganov/llama.cpp/issues/9246#issuecomment-2425169859 support using llama.cpp. It should rather quickly come to ollama once it's merged

— Reply to this email directly, view it on GitHub https://github.com/B-urb/doclytics/issues/97#issuecomment-2425299864, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSP7VQUUD7TQ5XGP7CDS4TZ4Q7FBAVCNFSM6AAAAABOHU3L4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRVGI4TSOBWGQ . You are receiving this because you were mentioned.Message ID: @.***>

icereed commented 2 weeks ago

Hello 👋

for interested folks who want to play around with LLM OCR and paperless: An experimental feature is available in paperless-gpt. Please follow this issue to learn more: https://github.com/icereed/paperless-gpt/issues/20

B-urb / doclytics

Using doclytics to update the "Content" of a document #97