eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.58k stars 119 forks source link

Rotate / flip PDFs? #554

Open bjeanes opened 3 years ago

bjeanes commented 3 years ago

I have a bunch of documents which are upside down (but which I didn't realise until Docspell slurped them in and previewed them).

Have you thought yet about adding the option to rotate a PDF. Eventually, you may even be able to use some heuristics to propose flips/rotations automatically (e.g. if you can't OCR much text, but can after rotating, or by training a model to detect upside down text, etc)

eikek commented 3 years ago

Thanks for the suggestion! Yes, I thought about it and it is something I would like to add. But tbh it is not on the top currently, which may change if people want this more than other things…. I would like to be able to remove blank pages and rotate the pdf pages manually. This was one reason for converting everything to pdf, so it can be manipulated later without changing the original. And whatever can be done automatically, should be :-) I would also think that rotation should be possible to automate.

eresturo commented 3 years ago

Another idea could be automatic page trimming.

Or automatic blank page detection: I experimented with different methods in my scanner script. In the end i just used a simple threshold on the standard deviation of the image pixels: https://github.com/eresturo/scanadf2docspell/blob/5aa3d05c3669c4715db3b24400226d9db42d1c4f/src/preprocessor.py#L29 Works quite well on my documents, Threshold could be configurable and empty pages could only be "hidden" instead of removed.

bjeanes commented 3 years ago

Yeah that would be nice too. Fortunately my scanner does empty page removal for me so that isn't something I thought about.

eikek commented 3 years ago

Yes, that would be nice indeed. A step after pdf conversion could do all this. What I think would be nice too, is to be able to split pdfs based on some stamp or sign that indicates the last page (or a separator page).

vakilando commented 3 years ago

this is a really good idea, would be cool!

split pdfs based on some stamp or sign that indicates the last page (or a separator page).

eikek commented 2 years ago

Just a small update: parts of this could now be achieved by using this addon. It is still a feature I would like to have "first class" in docspell, but until this comes the addon is an alternative that can be used right now.

dariuszszyc commented 2 years ago

I either do something wrong or this alternative doesn't solve the problem. My issue is that my original document is in an incorrect rotation, hence OCR couldn't really understand the text.

I did use the addon you mentioned, however it rotates only the processed/result pdf, not the original. When I use re-processing (to get the correct OCR text after rotating it properly) - it still used the incorrectly rotated original.

What I'd like to achieve is rotating the original, the re-processing it.

eikek commented 2 years ago

@dariuszszyc hm, the addon should also overwrite the extracted text in docspell so that you can use fulltext search etc. Does this not work (without an additional reprocess)? The original file will never be touched, though. But the "converted" file should be rotated and the extracted text should be updated as well.

dariuszszyc commented 2 years ago

@eikek didn't work for me. I did few more tests and here are the results:

  1. First I uploaded original document (jpg file with incorrect rotation) - OCR couldn't recognize the text properly

  2. I used the rotate addon - the converted PDF got rotated, but the extracted text didn't change

  3. Also, I made a copy of the jpg file, rotated it with Windows Photos app (then, to ensure, I checked with paint - it was rotated properly) and uploaded. The result was having the "original" jpg file rotated properly, but the converted PDF is rotated incorrectly (as it was originally in point 1).

  4. However, when I took the properly-rotated jpg file from point 3, opened in paint, added a single dot anywhere and uploaded - then both the original file and converted PDF were rotated properly (rotation wasn't changed as it happened in point 3) and OCR properly recognized the text.

eikek commented 2 years ago

Thank you for these details, @dariuszszyc . I think point 2 is a bug then, I need to look into it.

Point 3 and 4: When using JPG, it is often the case that the orientation is stored as metadata (kind of) and viewers will either interpret it or not. Some tools won't really rotate the image, but change the orientation setting only. When you edit the image data somehow (when you added a single dot), then the tool is required to store it anew. Could you maybe send me some example jpg file so I can reproduce this?

eikek commented 2 years ago

Also maybe we can use a new ticket for this problem here - I just created one https://github.com/docspell/rotate-pdf-addon/issues/1 copying your notes.

dariuszszyc commented 2 years ago

Thank you for these details, @dariuszszyc . I think point 2 is a bug then, I need to look into it.

Point 3 and 4: When using JPG, it is often the case that the orientation is stored as metadata (kind of) and viewers will either interpret it or not. Some tools won't really rotate the image, but change the orientation setting only. When you edit the image data somehow (when you added a single dot), then the tool is required to store it anew. Could you maybe send me some example jpg file so I can reproduce this?

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

eikek commented 2 years ago

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

No worries! You can send me an e-mail or chat me private at matrix (see readme) - ofc if you can just create a new file with some sample content, then you could also post it here.

dariuszszyc commented 2 years ago

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

No worries! You can send me an e-mail or chat me private at matrix (see readme) - ofc if you can just create a new file with some sample content, then you could also post it here.

Sample files:

  1. Original (incorrect rotation) 1  orig

  2. Rotated with Windows Photos app (probably changed orientation only in metadata) 2  rotated

  3. Screenshot of a properly-rotated file (point 2) - it's basically a new file - not just changed metadata. 3  properly rotated

Results from Docspell below. Please keep in mind my OCR is set to Polish, therefore you might see some polish characters in extracted content.

  1. Original (incorrect rotation) Extracted content
    
    "7noś Jojsnf
    JU3JUO2 UJIM Po qaM 9U1JO JOUJ02 Az0I Y

uU9Qq 9UL

Processed PDF (picture of it)
![1 orig_processed](https://user-images.githubusercontent.com/41972270/185057908-96981b6e-91c6-4271-b2df-fb6361b8f552.jpg)

2. Rotated with Windows Photos app (probably changed orientation only in metadata)
All apps display this picture in a correct orientation, but in docspell I see the original one (doesn't follow the metadata orientation?).

Extracted content

"7noś Jojsnf JU3JUO2 UJIM Po qaM 9U1JO JOUJ02 Az0I Y

uU9Qq 9UL

Processed PDF (picture of it)
![2  rotated_processed](https://user-images.githubusercontent.com/41972270/185058681-31387349-8e5c-473b-8de4-4dd4c701fa43.jpg)

4. Screenshot of a properly-rotated file (point 2) - it's basically a new file - not just changed metadata.
Extracted content

The Den

A cozy corner ofthe Web filled with content justforyou.


Processed PDF (picture of it)
![3  properly rotated_processed](https://user-images.githubusercontent.com/41972270/185058315-e38acefd-f1d0-44ae-b619-b92069df511a.jpg)
eikek commented 2 years ago

Thank you @dariuszszyc - I'll test with these.

madduck commented 1 year ago

Linking #1437