R0Wi-DEV / workflow_ocr

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.
GNU Affero General Public License v3.0
79 stars 6 forks source link

OCR Overwrites digitally signed files #220

Open farhills opened 1 year ago

farhills commented 1 year ago

Describe the bug

Files with a digital signature are being overwritten, deleting the digital seal (leaving just the image of the signature)

System

How to reproduce

Steps to reproduce the behavior:

  1. create a pdf and apply a digital signature
  2. allow cron to run
  3. signature is deleted from document

Screenshots

image

Additional context

I've deleted the OCR rule for 'file modified', but in my typical workflow I print to PDF and immediately sign, so the files are captured in the queue and often don't get processed until after they've been signed.

It would be great if we could detect if a file is signed and skip it.

I've also commented on ocrmypdf #1040 as I recognize this issue may be more appropriately directed toward that project.

https://github.com/ocrmypdf/OCRmyPDF/issues/1040

R0Wi commented 1 year ago

Hi @farhills and thanks for reporting this. Indeed I'm afraid you're right and this issue seems rather be related to ocrmypdf than to this app. The app itself doesn't handle the contents of the converted files except that it creates a new file version in NC with the result of the ocrmypdf conversion.

As far as I understand, technically the tool cannot preserve a valid digital pdf signature since it changes the documents content which invalidates any signature.

One way would be to tell ocrmypdf to again sign the document after the process (which is currently not possible AFAIK). If it's possible to check if a pdf is signed or not, we could also add an option "Skip signed pdf" to the app itself.

If you're able to sign your documents via CLI, you could also try to chain the OCR workflow with the external command workflow

farhills commented 1 year ago

Thanks, as I wrote the issue I realized it would be the underlying library that has to deal with this. My professional organization has teamed up with a very closed-source certificate authority, there's no CLI option for signing. The process is heavily locked down.

I'll mark the issue as closed. If ocrmypdf adds a new switch '--skip-signed' or similar I'll open a new feature request here to tap into that functionality. Thanks!

farhills commented 1 year ago

And just like that it's been fixed! OCRmyPDF, V14.4.0 and later will preserve digital signatures by default. Earlier versions clobber the signature without warning.

OCRmyPDF cannot preserve digital signatures in PDFs and also add to OCR to them.
By default, it will refuse to modify a signed PDF regardless of other settings. You can
override this behavior with ``--invalidate-digital-signatures``; as the name suggests,
any digital signatures will be invalidated.

OCRmyPDF cannot open documents that are encrypted with a digital certificate.

Versions of OCRmyPDF prior to 14.4.0 would invalidate existing digital signatures
without warning.

https://github.com/ocrmypdf/OCRmyPDF/commit/a371655052a488c59b82ae659642bc76f57c1399

R0Wi commented 1 year ago

Thanks for letting us know! Sounds like we might want to introduce an additional switch for the digital signature behaviour.

farhills commented 1 year ago

In my use case, digitally signed documents should never be changed, even if the document OCR is imperfect or incomplete. These files represent final outputs, and need to be retained unmodified.

When OCR is complete, a new file is saved, so the digital signature is lost (opposed to editing a signed file where the signature is retained, but made invalid due to the edit).

I would, at most, add the --invalidate-digital-signatures flag only for the 'Force OCR' option. Safer for the user, but a bit more work for you, would be an opt-in UI checkbox 'include digitally signed files'. Either way, there needs to be a warning to the user that the signed file will be replaced by the OCR output, and the signature will be permanently lost.

farhills commented 1 year ago

Some additional feedback - the app notifications need to be updated to catch and handle the no-output condition when processing a digitally signed file. IMO this can be done silently. Currently it throws an error in the browser and desktop client.

image

CLI output for the same file:

root@5ea6340167e7:/data/xxxxxxxxxxxxxx/files/Misc-JD/OCR-Testing# ocrmypdf 'Digital Signature Sample.pdf' sigoutput.pdf
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document,         _sync.py:432
invalidating the signature.
R0Wi commented 1 year ago

Good catch, thanks for the hint. I think we need to properly recognize this situation and don't throw an error but instead logging an information for example.

yeupou commented 10 months ago

Hello,

In my use case, most of the time I would not care about the original digital signature but do care about proper OCR. I do understand that an altered file cannot retain original signature and nonetheless want OCR.

But I would not use force OCR because I do care not to destroy original (probably best) OCR.

It would be great if it was an option like the Remove background option, because it perfectly make sense to accept possible deletion of digital signature in modes like skip text.

image

R0Wi commented 10 months ago

Current implementation plan would be like the following:

ferdiga commented 1 month ago

please see my comments here https://github.com/ocrmypdf/OCRmyPDF/issues/1003#issuecomment-2216803297 If the process encounters a digitally signed PDF it just could make a copy and process the copy marking the file with a meaningful tag "OCR-no-signature" or similar. I think I do not need to emphasis how important it is to OCR-scan digitally signed documents for search purpose.