ciur / papermerge

Open Source Document Management System for Digital Archives (Scanned Documents)
https://papermerge.com
Apache License 2.0
2.54k stars 266 forks source link

Quick question about OCR support #635

Open deajan opened 2 days ago

deajan commented 2 days ago

Hello,

I currently tried paperless-ngx and found it to not fit my usecases. Mostly, I've spend some time developping support for EasyOCR for paperless-ngx, only to find out that the developpers aren't fond of supporting alternative OCR engines.

As far as I understood, papermerge uses tesseract ? Is there any plugin system / something else where to "plugin" another OCR engine, given that it handles hOCR ?

Thanks. Side question: Does papermerge handle user/group permissions ? If so, can they be assigned automagically for new documents, according to tags or something alike ?

ciur commented 1 day ago

Yes, papermerge uses tesseract.

Is there any plugin system / something else where to "plugin" another OCR engine, given that it handles hOCR ?

Well, not really. There is no official "plug-in system". But coupling with Tesseract is very thin and it is easy to add support for almost any OCR engine.

Basically, OCR part is separate application, called OCR-worker, which is connected with main app only via celery messages.

The whole dependency on OCR engine is just this module: https://github.com/papermerge/ocr-worker/blob/main/ocrworker/ocr.py (of course I don't count system dependencies, which are assumed present in dockerimage) The entrypoint of the OCR are in tasks.py module

Side question: Does papermerge handle user/group permissions ? If so, can they be assigned automagically for new documents, according to tags or something alike ?

Well, yes and no.

Yes. Papermerge handles user/groups/permissions, but not in sense you probably mean.

Your question, I guess, is about permissions per object/resource (in this sense, specific document or folder). No. Per object/resource/folder/document permissions are not there yet. I will add them at the beginning of 2025.

deajan commented 1 day ago

Thank you for your quick reply. I've worked with OCRMyPDF to make EasyOCR work under celery and headless I guess this work would render it compatible with papermerge.

Would you mind to shortly explain the permission system in papermerge ? My usecase is sharing documents with my family:

Is that something I can achieve with Papermerge easily ?

ciur commented 1 day ago

My usecase is sharing documents with my family: .... Is that something I can achieve with Papermerge easily ?

No. Not now. Currently permissions are there to limit users to specific URLs (the technical term is "endpoints"). In other words, currently you can say: "user coco does not have permissions to access GET /groups/, POST /groups/, GET /groups/". But coco has access to "GET /nodes/, GET /documents/".... When you define permissions there is no concept of specific document. You can either grant user access to ALL documents - or to None, to all Groups or None, to all Folders or None.

As I mentioned above, per object permissions, this is your case when you try to grant access to specific folder or document, will come soon - beginning of 2025 (I think it will be February, 2025)

deajan commented 1 day ago

Thank you for the insight :) I'll see if I can chip in a bit time to integrate EasyOCR into papermerge, since it's results are generally superior to tesseract.