ciur / papermerge

Open Source Document Management System for Digital Archives (Scanned Documents)
https://papermerge.com
Apache License 2.0
2.41k stars 257 forks source link

OCR saying "unsupported format" for PDF and JPG file #587

Closed Chavell3 closed 5 months ago

Chavell3 commented 5 months ago

Hi Team,

after a bunch on tries I could now successfully set up Papermerge. So by the looks of it, all connections are working between each of the instances... but when I upload a file and try to run OCR manually it fails. Within the logs of the Worker node I see a message of "unsupported format" and it can be a PDF or JPG file which both are supported.

But by the documentation, PDF and also JPG file should work.

docker compose.txt PM_worker_log.txt

Any idea what I could change to make it work?

Info:

Thanks!

Chavell3 commented 5 months ago

also interesting if I try to manually run the OCR detection there is no logging about a new task on the worker node... I would have expected to get a new task...

BUT on the WEB service I see the following log... and something seems to be wrong there... PM_web_log.txt

ciur commented 5 months ago

@Chavell3

Does it happen for all PDF, JPG images you've tried? Or only for some of them? Would you mind attaching one problematic file (one pdf and one jpg) to this ticket so that I can troubleshoot it?

thndrbck commented 5 months ago

Check to see if the file uploaded completely. Also check to see that the file is actually pdf or jpg. I had one file rejected because it had the wrong extension (three letters after the dot in the file name), and a few that didn't completely upload when I tried uploading 30 at a time.

Chavell3 commented 5 months ago

I don't think it's a matter of a specific file, I now uploaded like 6 additional files(PDF's and JPEG's) non of them is scanned... Any idea what I could do to additionally troubleshoot that?

small side note, although I entered the volumes for media and database within the compose file... those folders still stay empty... I added a picture for that. If you still like to have some files, just give me shout... but I don't think it's file related...

msedge_01312024_212907

putty_01312024_212823

Thanks for the help.

ciur commented 5 months ago

Run following command in worker container:

/usr/bin/file --mime-type -b <path-to-pdf-or-jpg-file-you-have-uploaded>

e.g.

 /usr/bin/file --mime-type -b /core_app/media/docvers/67/88/67883da5-2626-4d8a-9cbd-e861abce863c/1706648353180682379972750067052.jpg.pdf

and tell me the result here

Chavell3 commented 5 months ago

okay... already the folder "media" does not exist under /core_app

image

But my fault... wait let me test something...

Chavell3 commented 5 months ago

I now added the docker volumes manually to mount those to my wanted folders.

putty_02012024_112327

But it still seems not to mount those volumes correctly. I do found the issue... which is, that the worker and web node are not mounting the media volume correctly somehow, although it is listed when running "df"

Worker-Node: putty_02012024_112459

Web-Node: putty_02012024_112540

Somehow the web node has access to such volume but the worker node does not... Interesting is, that /dev/md0 is my raid device where I want the files to be safed but I want to choose some subfolder, DMS/papermerge/..

The storage configuration within docker compose configuration looks like that: msedge_02012024_113342

But also there, nowhere just /dev/md0 is defined... it's always some subfolder(either "docker" or "DMS")...

Chavell3 commented 5 months ago

It seems like "/dev/md0" is somehow just the naming, but it is correctly mounted to my subfolder within the directory. Because when I browser the container's FS and compare that with the local FS where it should be located those files are correct.

putty_02012024_115043

That maybe means, somehow the worker node seems not to be able to mount the volume "MEDIA" because of some permission stuff... and same for the WEB node because it does not create any file within that folder...

Chavell3 commented 5 months ago

I think the issue is, that the folder "media" under /core_apps does not exist. So it cannot mount that volume under that directory. When I start the WEB node and login into the container, the folder "media" also does not exist.

putty_02012024_201628

BUT the difference is, it seems the WEB node does create that folder when the first file is uploaded. While the WORKER node tried to read from a directory that just does not exist, because it was never created or correctly mounted...

I created that folder "media" for the WEB and WORKER node and stopped and started them again but unfortunatly it still did not mount the volume by the looks of it, because I still can't see data that has now been created unter /core_apps/media

Chavell3 commented 5 months ago

but even if I create that folder, build a new repository from that running container(with the "media" folder) and rebuild my hole papermerge environment, it does not seem to work because still on the host all created files in that volumes are not visible or does not exist on the host....

Chavell3 commented 5 months ago

OKAY shame on me... all my fault... first tried to directly mount the host folders and messed the config there. After I fixed that, I did wrote "/core_apps" instead of "/core_app"... After I corrected that, now everything works as expected.