Open pschichtel opened 8 months ago
Text extraction and auto tagging seems to work fine, it's just the pdf with selectable text is missing, which I like to use.
Thanks for reporting! Is this the docker setup? or how do you run docspell?
Hi @pschichtel - it looks like according to your output that the PDF conversion failed, and you said that you've scanned many documents successfully before.
It sounds like this is an issue specifically with converting this document. I encountered a similar issue.
For scanning this PDF, let's try editing your configuration a bit. In the /etc/docspell-joex/docspell-joex.confconfig
try adding "--output-type", "pdf",
to the options (this should come after --skip-text) and then go ahead and restart docspell-joex
.
# The `--skip-text` option is necessary to not fail on "text" pdfs
# (where ocr is not necessary). In this case, the pdf will be
# converted to PDF/A.
ocrmypdf = {
enabled = true
command = {
program = "ocrmypdf"
args = [
"-l", "{{lang}}",
"--skip-text",
"--deskew",
"--output-type", "pdf",
"-j", "1",
"{{infile}}",
"{{outfile}}"
]
After editing so it appears similar to the excerpt above, restart docspell-joex.
sudo systemctl restart docspell-joex
or use an equivalent command if on docker.
Try rescanning in the document and see if this line from your failed job disappears:
Tue, February 20th, 2024, 21:03: PDF conversion failed: Command result=3. No output file found.. Go without PDF file
Let us know if that worked. It would be good to know if using "--output-type", "pdf",
was a better default than PDF/A. Similar to #2486
Thanks for reporting! Is this the docker setup? or how do you run docspell?
I'm running it in kubernetes with my own helm chart.
@tenpai-git I can check that later today. Just one thing I want to add: I don't think it's an issue with this specific document for 2 reasons:
Can I safely downgrade 0.41.0 to 0.40.0 ?
Thanks for getting back to me @pschichtel - it may certainly be the case that it's another element of docspell now being presented with these new reasons. I'm not sure about the downgrade, but why don't you install ocrmypdf
locally and give it a try to see if we can isolate if this issue is related also?
ocrmypdf -l deu ./input_pdf.pdf ./output.pdf --output-type pdf
Depending on the file you might need to add --skip-text
flag to the above command as well.
Try on both a known working previous document and the new document and see if there's any difference.
@pschichtel Rather than downgrade by using a previous database backup and the previous version, maybe try upgrading to nightly 0.4.2 version? I am using PostgreSQL and was actually noticing a similar problem with a couple pdfs as I was testing, but then upgrading resolved it for me.
Please let me know if the other test had any different results.
@tenpai-git I tried upgrading to nightly, that didn't change anything sadly. I tried the command from the job log on my system and that worked without issues. The version of ocrmypdf from the arch AUR is 16.1.1
while the version in the joex container is 15.4.2
. I tried the same with docker.io/jbarlow83/ocrmypdf:v15.4.2
which also worked without issue, so it really seems like something is off with the joex image. I'll play around with the joex images.
Sadly the joex container doesn't build for commits older than bb181f1819fedbc495b664c3448a9fbb318b00c6 and that commit is already broken for me.
Ok I worked around the build issue and bisected the problem to 90972a0cc01517150c70e27f2f776d37bc783c00, which is the alpine image update. When I build the image from master with the base image changed to the previous alpine:3
(or the more specific alpine:3.19.1
) it also works again.
So I assume some dependencies are somehow incompatible in alpine:edge. I doesn't seem like any of the directly installed alpine packages have any major releases/changes between 3.19.1 and edge.
Hi @pschichtel and @tenpai-git thanks for taking a deep look here. I can tell that maintaining the docker images is a real pain for me. One mistake was to have alpine edge as the base image. Don't remember why that is, actually. I had many problems with ocrmypdf on alpine in general. I want now to pull that docker image stuff outside the repo, because I just don't have the time to hunt down these things so often. Another option I was thinking about is to provide docker images based on the nix setup. Anyways, perhaps as mentioned in https://github.com/eikek/docspell/issues/2502#issuecomment-1962860204, it might be good to move kubernetes + docker to a separate repo, where people with better knowledge in that space can operate.
Ok I can ACK that the tesseract issue is happening within the current docker build because of error messages written to stdout starting with Error
(see https://github.com/ocrmypdf/OCRmyPDF/blob/16ab4a8b4ec82175880f235953d99e9c5265b634/src/ocrmypdf/_exec/tesseract.py#L130).
Inside the joex container you get something like this calling the tesseract binary:
3486937fbd9c:~# tesseract --list-langs
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.
[DS] Device: "(null)" (Native) evaluation...
Error in pixCloseBrick: pixs not 1 bpp
Error in pixOpenBrick: pixs not defined
Error in pixSubtract: pixs1 not defined
Error in pixOpenBrick: pixs not defined
Error in pixOpenBrick: pixs not defined
[DS] Device: "(null)" (Native) evaluated
[DS] composeRGBPixel: 0.019891 (w=1.2)
[DS] HistogramRect: 0.093792 (w=2.4)
[DS] ThresholdRectToPix: 0.048494 (w=4.5)
[DS] getLineMasksMorph: 0.000075 (w=5.0)
[DS] Score: 0.467569
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.467569
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (23):
ces
dan
deu
est
Unfortunately there are lines starting with "Error" so that ocrmypdf
things there is serious trouble getting the installed languages and aborts the OCR process.
If you run the ocrmypdf
CLI manually inside the container for a second time the tesseract_opencl_profile_devices.dat
written at the first execution (with the Error lines found at stdout) will finally do the processing.
Somehow the recent Alpine version of tesseract
was compiled with opencl
support. Now that feature first tries to locate GPU drivers and does some profiling. The result is written into the *.dat file (see https://github.com/tesseract-ocr/tesseract/blob/94bd98b7ef8e05319301a4879fbc10d11d68ebc7/src/opencl/openclwrapper.cpp#L2357).
However on a fresh Docspell run that profile is missing... tesseract
starts looking for OpenCL... it find some errors... but eventually writes that damn *.dat file aborting the processing... that leads to ocrmypdf
returning exit code 3 and aborting the mess it created. Docspell receives exit code 3 and aborts PDF/A processing.
To solve this issue:
/tmp
folder before executing the ocrmypdf
CLI (see my temporary workaround down below) and add or create the required *.dat file inside the Dockerfile
when building the image ORtesseract
binary Patch ocrmypdf
to change the working directory to /tmp
where the *.dat file will be stored (i.e. outside the volatile convert directory Docspell is removing automatically).
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from ocrmypdf.__main__ import run
if __name__ == "__main__":
from os import chdir
chdir("/tmp")
sys.argv[0] = re.sub(r"(-script\.pyw|\.exe)?$", "", sys.argv[0])
sys.exit(run())
Replace /usr/bin/ocrmypdf
with the above patched version. I added the chdir()
method to change to /tmp
folder.
The first execution of ocrmypdf
from Docspell process will fail, but you'll find the *.dat file inside /tmp
folder. Alternatively you can cd /tmp && tesseract --list-langs
inside the container to create the required profile file before processing your scans by Docspell preventing first time failure.
Any further Docspell PDF/A calls will now find that .dat file and processing will work as expected. Beware that the /tmp folder is volatile and in case you'll going to re-create the vanilla container the .dat file is lost and needs to be re-created on first PDF/A run (that will fail). So better patch, add the *.dat file and commit your changes as an updated local container image.
a script executed from the entrypoint might be a reasonable place to put the initial tesseract --list-langs
. On the other hand: wouldn't it be easier to just installed the necessary driver components in the container? I have a GPU available that I could use with this, might be interesting to try.
Yea maybe that would speed up the OCR stuff.. however as I understand tesseract
code the profile .dat file will be loaded at startup and if not found a fresh one created... this on the other hand is a problem for Docspell as the converter temp folders (containing the initial .dat file created) will be removed so the tesseract
process will never start OCRing your file on the next run. In the end the .dat file must be previously created (at container startup is a good idea) and kept outside volatile folders removed by Docspell. tesseract
CLI for that matter simply looks inside the current working directory for the .dat file that's why the patched ocrmypdf
changes it to /tmp
. Docspell I guess is CWDing into the random volatile /tmp/docspell-converter/...
folder where the scan and it's OCRed output will be located.
setting ENV TESSERACT_OPENCL_DEVICE=1
could also fix the issue. tesseract will still do its device profiling every time, but since an explicit choice is given by env it will not fail.
opencl packages seem to be a mess on alpine, there is only really rusticl which doesn't seem to implement what tesseract requires.
@eikek If we don't know a reason for going with alpine:edge (given how old that change is, I assume what ever dependency update was desired is probably already released in alpine:3), can we just revert this commit until the whole community managed docker idea is implemented and "better" images are provided?
setting ENV
TESSERACT_OPENCL_DEVICE=1
could also fix the issue. tesseract will still do its device profiling every time, but since an explicit choice is given by env it will not fail.
I tried that and still the profile *.dat is required and aborts processing on first run...
114caa7b8fe6:/tmp# ocrmypdf -l deu --skip-text --deskew -j 1 test.pdf hello.pdf
Tesseract failed to report available languages. __main__.py:69
Output from Tesseract:
-----------
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.
[DS] Device: "(null)" (Native) evaluation...
Error in pixCloseBrick: pixs not 1 bpp
Error in pixOpenBrick: pixs not defined
Error in pixSubtract: pixs1 not defined
Error in pixOpenBrick: pixs not defined
Error in pixOpenBrick: pixs not defined
[DS] Device: "(null)" (Native) evaluated
[DS] composeRGBPixel: 0.017175 (w=1.2)
[DS] HistogramRect: 0.088656 (w=2.4)
[DS] ThresholdRectToPix: 0.033539 (w=4.5)
[DS] getLineMasksMorph: 0.000053 (w=5.0)
[DS] Score: 0.384571
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.384571
[DS] Selected Device[1]: "(null)" (Native)
[DS] Overriding Device Selection (TESSERACT_OPENCL_DEVICE=1, 1)
[DS] Overridden Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (23):
ces
@eikek If we don't know a reason for going with alpine:edge (given how old that change is, I assume what ever dependency update was desired is probably already released in alpine:3), can we just revert this commit until the whole community managed docker idea is implemented and "better" images are provided?
Sure! Whatever makes this part easier is a plus for me. I can't remember why it is set to alpine:edge. I can't think of a reason why I would do it. Perhaps it was some missing dependency/newer version.
If I understand your analysis, tesseract (now?) needs a separate file that it will create on a first run? If that is so, I think docspell needs to provide some kind of non-volatile cache place for such things. Of course, this is a bit unfortunate from docspells point of view, because it is now more tricky to maintain. Perhaps tesseract could be configured to use a specific directory.
Sure! Whatever makes this part easier is a plus for me. I can't remember why it is set to alpine:edge. I can't think of a reason why I would do it. Perhaps it was some missing dependency/newer version.
The commit that switched to edge is from may last year. Alpine 3.19.1 is from end of last month, so I don't see any risk here. A patch release with the revert would be very appreciated.
If I understand your analysis, tesseract (now?) needs a separate file that it will create on a first run? If that is so, I think docspell needs to provide some kind of non-volatile cache place for such things. Of course, this is a bit unfortunate from docspells point of view, because it is now more tricky to maintain. Perhaps tesseract could be configured to use a specific directory.
It's weird. The alpine edge version of tesseract is 5.3.4 compared to 5.3.3 in stable, so just a patch version difference. Nothing in the diff seems related. Also the package build between stable and edge is basically identical (compare stable and edge). I took the 3.19.1 based container I built locally and just upgrade tesseract-ocr to edge and the problem started. I assume something in the build environment of tesseract changed with edge.
I think the culprit is here: https://github.com/eikek/docspell/pull/2066 so it's not your fault @eikek :) Renovate bot somehow managed to automerge a major release change from 3 to 20230329 at the time of the PR. This change, be it a Renovate bug or image retagging whatsoever, switched from the 3 branch to edge branch with further major edge branch updates.
Renovate docs tell that major release changes for Dockerfiles must be explicitly activated but the PR summary created by Renovate shows a different picture.
As @pschichtel suggested I think the best solution without any hacks would be to tag the docker base image using major.min.patch
semantic version. In that case when you use FROM alphine:3.19.1
Renovate should only upgrade minor
and patchlevel
versions if we do believe the docs.
I compared some build logs for alpine 3 branch vs. edge branch related to tesseract package and there it seems that the stable branch didn't use the --enable-opencl
flag vs. edge enabling it. Tesseract sadly has no option to disable opencl support so you have to decide at compile time if you want to use it or not. Once activated this becomes an issue for Docspell because tesseract creates that profile *.dat file in the current working directory. Again there is no option for tessarect CLI to change that profile location so ocrmypdf would need to change its CWD to allow tessarect to save its profile file inside the current working directory (thats what I patched with my ugly workaround).
At work my team is also using Renovate but do also pinning the docker image by hash just in case someone pushes a new image using the same version tag. This will be detected by Renovate and a PR with updated docker image fingerprint created. The idea behind pinning your image with a digest is to have immutable builds using the same exact base image used in previous builds even if the registry image was updated behind the scenes.
I think the culprit is here: #2066 so it's not your fault @eikek :)
Ah amazing thank you for digging this out :) Still should know how these tools work ;-)
I totally agree to use a stable tag for the image! I didn't know that renovate would still update minor and patch, I think this is perfect then. I would be also fine with pinning it using a hash.
Thanks a lot to both of you for analysing this so well! 💯
I'm thinking about creating a new docker image manually (0.41.0-1 or similar) with alpine:3 (or hash, whatever you prefer) and the other fix regarding tesseract language (#2479).
I now changed the base image as suggested to 3.19.1 here - do you think this is enough to fix this immediate problem?
Also, I assume when alpine edge becomes stable, we need to deal with this dat file somehow right?
I pushed new images, unfortunately my brain didn't work so well and I pushed under the same tags... I wanted to create different version, of course. Curious to see if that helps now.
yep, it works again.
Should we close this or keep it open to discuss solutions for the issue once it comes back in the future?
Should we close this or keep it open to discuss solutions for the issue once it comes back in the future?
My opinion would be to close it if it is working again. It could be confusing when looking at an open issue, when the problem has been fixed. Next time this comes up, we should be able to dig the issue out - open or closed. But it's not a strong opinion at all, happy to leave it open if you prefer.
Alpine Linux is removing the --enable-opencl flag from the build and latest should start working again soon.
I am running the 0.42.0 image, but still have the issue as described above. Am I missing something? I thought it was fixed.
Thu, September 12th, 2024, 11:20: Storing input to file /tmp/docspell-convert/docspell-ocrmypdf5771335655514927351/infile for running ocrmypdf
Thu, September 12th, 2024, 11:20: Trying to read the PDF using 0 passwords
Thu, September 12th, 2024, 11:20: Running external command: ocrmypdf -l deu --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf5771335655514927351/infile /tmp/docspell-convert/docspell-ocrmypdf5771335655514927351/out.pdf
Thu, September 12th, 2024, 11:20: Waiting for command to terminate…
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Tesseract failed to report available languages.
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Output from Tesseract:
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: -----------
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]:
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Device: "(null)" (Native) evaluation...
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Error in pixCloseBrick: pixs not 1 bpp
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Error in pixOpenBrick: pixs not defined
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Error in pixSubtract: pixs1 not defined
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Error in pixOpenBrick: pixs not defined
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: Error in pixOpenBrick: pixs not defined
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Device: "(null)" (Native) evaluated
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] composeRGBPixel: 0.038811 (w=1.2)
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] HistogramRect: 0.158294 (w=2.4)
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] ThresholdRectToPix: 0.150574 (w=4.5)
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] getLineMasksMorph: 0.000115 (w=5.0)
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Score: 1.104640
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Scores written to file (tesseract_opencl_profile_devices.dat).
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Device[1] 0:(null) score is 1.104640
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: [DS] Selected Device[1]: "(null)" (Native)
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: List of available languages in "/usr/share/tessdata/" (24):
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: ces
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: dan
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: deu
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: eng
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: est
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: fin
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: fra
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: heb
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: ita
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: jpn
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: jpn_vert
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: khm
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: lav
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: lit
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: nld
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: nor
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: pol
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: por
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: ron
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: rus
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: slk
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: spa
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: swe
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]: ukr
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]:
Thu, September 12th, 2024, 11:20: [ocrmypdf (err)]:
Thu, September 12th, 2024, 11:20: PDF conversion failed: Command result=3. No output file found.. Go without PDF file
actually I noticed that my recently added documents have not been OCR'ed, so I assume this issue had a come-back in the new container image.
Oh no, this is really sad! In 0.42.0 the base image is 3.20.2 and there was a recent update to 3.20.3 (https://github.com/eikek/docspell/commit/0657175da01ab1af5a6ac4588322938c4337e391) - does it work on the snapshot version?
I tried with your nightly version of joex, but still the same result. I think that one already uses the correct one as it was built after the commit.
@eikek why don't you use the ocr pdf as a base for your image. That one is based on alpine as well, but it probably better optimized for our usecase
or use their dockerfile as a base for yours:
https://github.com/ocrmypdf/OCRmyPDF/blob/main/.docker/Dockerfile.alpine
seems like a reasonable idea to base at least the joex image on this. If I find some spare time on the weekend I might setup a new container over at docspell/docker.
@tiborrr oh yes, good idea! It uses alpine 3.19.1 - and I don't really care what it uses (alpine or not). This makes sense for the joex component, the restserver doesn't need ocrmypdf.
The images makes choices for the user on which languages are installed. Do we want to add more languages or is the default ok?
https://github.com/eikek/docspell/issues/2779
We can continue the discussion here.
If no one has an issue with the resulting image size, we might as well install all languages automatically. I'd have to see how large the would be. alternatively some option could be added to install additional languages during startup, but those would be installed on every startup.
I don't have any issues with the issue being a little big bigger. If you update your docker image then docker compose will first pull the new image and will then replace the existing one with minimal to no downtime.
I'm also fine with a bigger image size. the main intention for the docker images was to have a convenient start with docspell.
Interesting finding: the ocrmypdf image is also broken once updated to alpine 3.20, which is not surprising given the issue is tesseract and not ocrmypdf. I'll create an issue over there to discuss this, because I assume this will eventually affect them too.
a first version of the new image is available at https://github.com/docspell/docker/pkgs/container/joex
it's based on ubuntu:24.10 and docspell 0.42.0 and installs all available tesseract packages ubuntu provides.
@pschichtel your link does not work (yet). At what branch did you make this change?
for future reference this is what OCRmyPDF says in it's alpine image.
Note: Alpine 3.20 builds tesseract with --enable-opencl, which is not
supported by anyone. OCRmyPDF is not compatible with Alpine 3.20.0
through 3.20.3. The Alpine issue should be fixed in 3.21.0. It is
not clear if 3.20.4+ will have the fix.
@tiborrr unfortunately the package has been set to private automatically and I can't change that.
I also received a response on my issue at ocrmypdf:
https://github.com/ocrmypdf/OCRmyPDF/issues/1395#issuecomment-2351836729
So our options for a quickfix here would be: switch to 3.19.* or switch to edge. My container uses ubuntu.
For quickly fixing the docker images, I could do again a rebuild using alpine 3.19.1 (as this has been working) wdyt? maybe there is a new docker build for the next release then
That would the quickest and easiest fix I guess. You have my blessing to do so.
Then we can figure out a new strategy in the mean time
@tiborrr the ubuntu-based image I referenced is accessible now
@pschichtel I will test Monday, I only have access to test servers during office hours 😅
@pschichtel I tested your Ubuntu-based image today and can confirm that Tesseract works as expected (converting scans to text now works like a blitz).
I've now also successfully re-processed a bunch of documents with my ubuntu-based container, which worked without issues. I think the container is a drop-in replacement.
Oh I'm sorry, totally forgot about this issue 😞 Could still do the image update or just switch to the ubuntu based image?
I'm on version 0.41.0 and I just noticed that I can't select text in my imported PDF (a scanned document).
Looking at the job log I found this:
I don't think I ever saw this error when importing my ~1000 documents.