Add support for `--output-type pdf`

SKB-CGN commented 1 year ago

Hi, i am not sure, how this error is connected to an upgrade of Nextcloud from 25 to 26.

I have uploaded a new pdf, which was processed, but displays the wrong char-set.

This is the original text:

This is the text of the converted one:

But, when selecting the text with the mouse and copying it, it displays the correct text. Which is: anbei erhaltenSiedieBetriebskostenabrech nungfürdasJah r2022. Bei derBerech nungwirdIhrNutzungszeitraum vom 01.03.2022-31.12.2022berücksic

Would be great, if you know, what kind of issue this could be.

Thank you!

R0Wi commented 1 year ago

Well since the app itself doesn't create the new PDF content I would assume there is a problem with ocrMyPdf itself. If possible please post the problematic file here or try what happens if you invoke ocrMyPdf directly from the CLI with the problematic PDF as input.

SKB-CGN commented 1 year ago

I did this tests:

root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf Using Tesseract OpenMP thread limit 2 Start processing 4 pages concurrently OCR: 0%| | 0.0/4.0 [00:00<?, ?page/s] PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr

Here, the file is not touched nor modified.

After running it with: root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf --skip-text [00:00<00:00, 55.94page/s] Using Tesseract OpenMP thread limit 2 Start processing 4 pages concurrently 2 skipping all processing on this page 3 skipping all processing on this page 1 skipping all processing on this page 4 skipping all processing on this page Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. JPEGs: 0image [00:00, ?image/s] JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.00 savings: 0.1% Output file is a PDF/A-2B (as expected)

the file gets "corrupted".

bahnwaerter commented 1 year ago

Can you repeat your test with the -v command line option to get more verbose output? Maybe this will reveal more about the problem.

There may be a problem with the metadata of the PDF, as observed in this issue. Metadata can be preserved if the output file is not an archived PDF file but a regular PDF file created by the additional command line option --output-type pdf.

R0Wi commented 1 year ago

Thanks for checking this @bahnwaerter ! If the --output-type pdf flag fixes this issue, we should think about adding it to the command queried by this app, what do you think? Is there any particular reason why this is not the default? 😄

SKB-CGN commented 1 year ago

@bahnwaerter Sure. Here is the output:

root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf --skip-text -v                        ocrmypdf 10.3.1+dfsg
Running: ['tesseract', '--list-langs']
No language specified; assuming --language eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf']
Running: ['gs', '--version']
Found gs 9.53.3
pikepdf mmap enabled
os.symlink(Abrechnung.pdf, /tmp/com.github.ocrmypdf.u18nzsj8/origin)
os.symlink(/tmp/com.github.ocrmypdf.u18nzsj8/origin, /tmp/com.github.ocrmypdf.u18nzsj8/origin.pdf)
pikepdf mmap enabled
pikepdf mmap enabled
Scanning contents: 100%|████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 83.58page/s]
Using Tesseract OpenMP thread limit 2
Start processing 4 pages concurrently
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
Rotations for page 0: [text, auto, misalign, content] = 0, 0, 0, 0
    1 skipping all processing on this page
Rotations for page 1: [text, auto, misalign, content] = 0, 0, 0, 0
    2 skipping all processing on this page
Rotations for page 3: [text, auto, misalign, content] = 0, 0, 0, 0
    4 skipping all processing on this page
Rotations for page 2: [text, auto, misalign, content] = 0, 0, 0, 0
    3 skipping all processing on this page
OCR: 100%|█████████████████████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 207.91page/s]
Running: ['gs', '-dQUIET', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/com.github.ocrmypdf.u18nzsj8/fix_docinfo.pdf', '/tmp/com.github.ocrmypdf.u18nzsj8/pdfa.ps']
stderr = GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: PDFA doesn't allow images with Interpolate true.

Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
XrefExt(xref=23, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
os.symlink(/tmp/com.github.ocrmypdf.u18nzsj8/optimize.opt.pdf, /tmp/com.github.ocrmypdf.u18nzsj8/optimize.pdf)
/tmp/com.github.ocrmypdf.u18nzsj8/optimize.pdf -> test_out.pdf
Output file is a PDF/A-2B (as expected)

bahnwaerter commented 1 year ago

Thanks for the verbose output @SKB-CGN. Now we can see that there are two problems in the input PDF file:

UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
PDFA doesn't allow images with Interpolate true

The first problem indeed concerns the PDF metadata. Here, the tool that generated the PDF, embedded characters in the metadata with an encoding that is not permitted in the PDF/A standard. The second problem can be understood more as a warning. Apparently an interpolated image should be embedded here, which is not allowed in the PDF/A standard either.

@SKB-CGN: What tool was used to create the PDF?

bahnwaerter commented 1 year ago

If the --output-type pdf flag fixes this issue, we should think about adding it to the command queried by this app, what do you think? Is there any particular reason why this is not the default?

Sure we could add this flag to the command line call performed by this app. However, this should never be done by default, but only used when necessary (especially if PDF files were not generated in accordance with the PDF/A standard).

Can we create an optional configuration in the workflow settings for this?

SKB-CGN commented 1 year ago

@bahnwaerter the tool is 'WISO Vermieter'. German tool from Buhl Data, to create invoices.

R0Wi commented 1 year ago

Can we create an optional configuration in the workflow settings for this?

Sure, sounds like the best and most flexible solution 👍

@bahnwaerter the tool is 'WISO Vermieter'. German tool from Buhl Data, to create invoices.

@SKB-CGN maybe that's one for the Buhl Data support team. I'd suggest they should produce PDF/A compliant documents 😄

SKB-CGN commented 1 year ago

@R0Wi Perhaps they should. But you know - big company with their own rules 😁

R0Wi commented 1 year ago

So to summarize: the problem mentioned here is mainly related to some PDF/A compliant issues which cannot be handled by ocrmypdf.

The meaning for this app would be to release a new feature:

Introduce a new per-workflow settings switch "Output type pdf" which (if set) sets --output-type pdf. If not set, --output-type is omitted.

bahnwaerter commented 1 year ago

Thanks @SKB-CGN for sharing the tool's name.

It is most likely the case that this tool does not create PDF/A compatible documents. However, this tool may also implement the latest version of the PDF/A standard, which Ghostscript may not currently support. Feel free to checkout the PDF/A version of your document. If the version number is supported by Ghostscript then the tool is faulty. In this case we would appreciate if you contact the Buhl Data support team and report the error.

@R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option --output-type, namely pdf and pdfa, as a dropdown selection box in the UI. Supporting all flags for this option allows you to have regular PDF files converted directly into PDF/A documents for archiving if desired.

R0Wi commented 1 year ago

@R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option --output-type, namely pdf and pdfa, as a dropdown selection box in the UI. Supporting all flags for this option allows you to have regular PDF files converted directly into PDF/A documents for archiving if desired.

@bahnwaerter In general I agree. But still I'm not sure if we really need a dropdown for this (I think we would need both a switch and a dropdown then...) since according to the docs:

By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with --output-type pdf.

So my understanding is that omitting the --output-type flag is basically the same as setting --output-type pdfa?

bahnwaerter commented 1 year ago

Yes, the text in the documentation clearly states that the default configuration of OCRmyPDF is the explicitly set --output-type pdfa option. Because of this fact we actually don't need a dropdown list of flags. So I totally agree with you.

If at some point the two-valued configuration logic is no longer sufficient, we can always introduce a dropdown list of flags.

SKB-CGN commented 11 months ago

HI, according to the problem, that the pages got the wrong signs after converting, i changed the worklfow to use:

skip file completely

which now produces a lot of warnings/errors inside Nextcloud.

Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200

Should this normally be suppressed as the application should not produce any output?

R0Wi commented 11 months ago

HI, according to the problem, that the pages got the wrong signs after converting, i changed the worklfow to use:

skip file completely

which now produces a lot of warnings/errors inside Nextcloud.

Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200

Should this normally be suppressed as the application should not produce any output?

This is a known problem which we're currently working on (see https://github.com/R0Wi-DEV/workflow_ocr/issues/232)

R0Wi commented 11 months ago

@SKB-CGN FYI: when implementing https://github.com/R0Wi-DEV/workflow_ocr/pull/233, I reviewed your problem as well but I came to the conclusion that logging a warning is mandatory if ocrmypdf writes something to the stderr but I removed the notification which was sent in that case. Unfortunately (in my opinion) we cannot reliably tell if we need to ignore a stderr message or not. So for example parsing the message, searching for "PriorOcrFoundError" and not logging an error if the OCR mode is set to "skip file" seems to be quite error prone to me and highly depends on the used ocrmypdf version. We could even miss some other warnings printed by ocrmypdf if we would skip the warning in general. So to me this is just bad design.

As a workaround please increase your loglevel so that for example only errors are logged. You can also use logrotation to control the size of your logs.

R0Wi-DEV / workflow_ocr

Add support for `--output-type pdf` #222