eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.58k stars 119 forks source link

FR merge multiple items into a single multipage attachement #1105

Open whysthatso opened 2 years ago

whysthatso commented 2 years ago

tldr: if technically possible provide an option for the merging of items to merge attachments into a single multipage document?

long: i have previously had a hard time to import multipage scans into docspell. my scanner (a simple canon flatbed controlled with scanservjs) spits out multiple jpgs in a zip container. when let consumedir import this, joex log processing succeeds with creating an item with two individual attachments, but for some reason the tesseract job fails:

2021-10-06T14:03:06: ===== Start reprocessing ======
2021-10-06T14:03:06: Loaded item and 2 attachments to reprocess
2021-10-06T14:03:06: Converting file Some(scan_2021-01-29 07.36.04 1.jpg) (image/jpeg) into a PDF
2021-10-06T14:03:07: Storing input to file /tmp/docspell-convert/docspell-tesseract1005841414723466335/infile for running tesseract
2021-10-06T14:03:08: Running external command: tesseract /tmp/docspell-convert/docspell-tesseract1005841414723466335/infile out -l deu pdf txt
2021-10-06T14:03:09: Command `tesseract /tmp/docspell-convert/docspell-tesseract1005841414723466335/infile out -l deu pdf txt` finished: 1
2021-10-06T14:03:09: tesseract stdout:
2021-10-06T14:03:09: tesseract stderr: Tesseract Open Source OCR Engine v4.1.1 with Leptonica Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2624; nwarn = 1 Error in pixReadStreamJpeg: bad data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.
2021-10-06T14:03:09: PDF conversion failed: Command result=1. No output file found.. Go without PDF file
2021-10-06T14:03:09: Closing process: `tesseract /tmp/docspell-convert/docspell-tesseract1005841414723466335/infile out -l deu pdf txt`
2021-10-06T14:03:09: Converting file Some(scan_2021-01-29 07.36.04 2.jpg) (image/jpeg) into a PDF
2021-10-06T14:03:09: Storing input to file /tmp/docspell-convert/docspell-tesseract11874634666246798395/infile for running tesseract
2021-10-06T14:03:09: Running external command: tesseract /tmp/docspell-convert/docspell-tesseract11874634666246798395/infile out -l deu pdf txt
2021-10-06T14:03:10: Command `tesseract /tmp/docspell-convert/docspell-tesseract11874634666246798395/infile out -l deu pdf txt` finished: 1
2021-10-06T14:03:10: tesseract stdout:
2021-10-06T14:03:10: tesseract stderr: Tesseract Open Source OCR Engine v4.1.1 with Leptonica Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2800; nwarn = 2 Error in pixReadStreamJpeg: bad data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.
2021-10-06T14:03:10: PDF conversion failed: Command result=1. No output file found.. Go without PDF file

so the item is there, but there are no available previews/pdfs generated. downloading the original archive or the individual original extracted files work fine and they are good.

when i put these images individually into the consumedir, all is well, gets imported, items get created, previews and pdfs generated.

it dawned on me when i did see the created item, however, that docspell will always just add the individual jpg's as individual attachments to an item, so hence my question: is it possible to add an option to merge them? or what is the way to import them to create multipage attachments, rather than per page attachments ?

whysthatso commented 2 years ago

just to be clear: the processing problem only came up during experimenting with different imports, it's not really related to this feature request.

eikek commented 2 years ago

Hi @whysthatso thanks for the suggestion. It is possible to do this. I'd like to have some basic pdf tools, like rotating and maybe deleting pages in the future, so merging them fits quite well. But tbh it's not high on the list currently. It is something I would consider, though.

The processing problem sounds like a bug - are you saying that when you upload some zip file containing 2 jpegs, it doesn't work and when you upload these two jpegs separately, it does work? If so, if that's possible to create a sample file for me and an issue would be great - then I could look into it.

Currently, there is no option to import multiple files into a single document. You'd need some script that combines these files into one before moving it into the consumption directory. A workaround would be to create an item with multiple files, though - the way you tried with a zip file. With the dsc tool and the web upload form you can also upload multiple files and tell it to be one item.

gandy92 commented 2 years ago

Basic pdf tools would be great, indeed (no pressure on the eta!) :+1: - imho, a much appreciated feature in that regard would be splitting up documents: More often than not I receive a PDF containing the cover note, some contract addition, some more or less unrelated information, all in one file. Currently, I open that with pdfarranger to split them up in individual files. Thankfully I can export the pieces directly to the docspell watch folder. Retrieving the already submitted multidocument file from docspell for postprocessing is a bit cumbersome which currently keeps me from using an IMAP source.

No expectations on an ETA of any kind, this merely is a nice-to-have in case those pdf tools make it into docspell one day. I'm glad docspell is available to me, thank you for all your efforts!