Open testingsw1 opened 2 years ago
This feature makes sense. However, "Editing the OCR text" steps over the feet of versioning concept. Say you have a document X.pdf which you OCRed. Thus you have the version 0 (original) and version 1 with OCRed (of poor quality i.e. some words were not detected properly) text. If you decide to edit the OCRed text, then some of the later OCRing will discard your text "corrections". If you choose to "manually (re-)run OCR", the newer version (version 3, created when you clicked "run OCR") will have again "bad quality OCR with missing keywords". Similarly, if you choose to rotate one page within the document, the entire document will be OCRed which will result in newer version (version 3) without corrected text.
On the other hand moving pages around, deleting pages, reordering pages, merge documents will increase documents version but will preserve corrected text.
@testingsw1 If above described trade off sounds ok for you, then I am perfectly fine going forward and implementing this feature.
Thank you! Of course I will be fine with this. I think most of us are having these documents as archives and there is not a lot of versioning. This feature will really help! I have scanner with Abbyy software that does great job on some documents - with this feature I can easily replace wrong text in papermerge (just copy correct text from Abbyy). Can't wait for new version! Thank you again!
Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.
Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.
Yes - just keep in mind that I've never tested what happens when you delete document version and there is no such "official feature" :)
In shell>
In [2]: from papermerge.core.models import Document
In [3]: doc = Document.objects.first()
In [4]: doc_version = doc.versions.get(number=2)
In [5]: doc_version
Out[5]: <DocumentVersion: id=f6298a48-a991-4b8a-a75c-5fde6d899c4b number=2>
In [6]: doc_version.delete()
Out[6]: (3, {'core.Page': 2, 'core.DocumentVersion': 1})
Where number=2
is version number of document instance. Last line from above output means that 3 objects were deleted from the database - one DocumentVersion and two associated Page model instances.
For me OCR result are not that good and sadly sometimes really bad. I wish to have at least few important key words OCRed. Is it possible to add option "Edit OCRed text" (button via GUI, under "View OCRed text", above Tags) so we can manually fix text?