Edit OCRed text - Githubissues

ciur / papermerge

Open Source Document Management System for Digital Archives (Scanned Documents)

https://papermerge.com

Apache License 2.0

2.55k stars 267 forks source link

Edit OCRed text #466

Open testingsw1 opened 2 years ago

testingsw1 commented 2 years ago

For me OCR result are not that good and sadly sometimes really bad. I wish to have at least few important key words OCRed. Is it possible to add option "Edit OCRed text" (button via GUI, under "View OCRed text", above Tags) so we can manually fix text?

ciur commented 2 years ago

This feature makes sense. However, "Editing the OCR text" steps over the feet of versioning concept. Say you have a document X.pdf which you OCRed. Thus you have the version 0 (original) and version 1 with OCRed (of poor quality i.e. some words were not detected properly) text. If you decide to edit the OCRed text, then some of the later OCRing will discard your text "corrections". If you choose to "manually (re-)run OCR", the newer version (version 3, created when you clicked "run OCR") will have again "bad quality OCR with missing keywords". Similarly, if you choose to rotate one page within the document, the entire document will be OCRed which will result in newer version (version 3) without corrected text.

On the other hand moving pages around, deleting pages, reordering pages, merge documents will increase documents version but will preserve corrected text.

@testingsw1 If above described trade off sounds ok for you, then I am perfectly fine going forward and implementing this feature.

testingsw1 commented 2 years ago

Thank you! Of course I will be fine with this. I think most of us are having these documents as archives and there is not a lot of versioning. This feature will really help! I have scanner with Abbyy software that does great job on some documents - with this feature I can easily replace wrong text in papermerge (just copy correct text from Abbyy). Can't wait for new version! Thank you again!

Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.

ciur commented 2 years ago

Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.

Yes - just keep in mind that I've never tested what happens when you delete document version and there is no such "official feature" :)

In shell>

In [2]: from papermerge.core.models import Document
In [3]: doc = Document.objects.first()
In [4]: doc_version = doc.versions.get(number=2)
In [5]: doc_version
Out[5]: <DocumentVersion: id=f6298a48-a991-4b8a-a75c-5fde6d899c4b number=2>
In [6]: doc_version.delete()
Out[6]: (3, {'core.Page': 2, 'core.DocumentVersion': 1})

Where number=2 is version number of document instance. Last line from above output means that 3 objects were deleted from the database - one DocumentVersion and two associated Page model instances.