eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.57k stars 117 forks source link

Extracted text is unreadable (random glyphs) for PDFs with Japanese text #2445

Closed lehnerpat closed 4 months ago

lehnerpat commented 8 months ago

Intro

Hi everyone,

thank you very much for creating and working on Docspell!

I've been wanting to get started with digitally organizing my documents for a while now. I found Docspell as one solution that might work well for me, so I've started trying it out.

One thing upfront: my use case is probably a bit unusual, since I have documents in three languages (German, English, and Japanese) that I want to put into my archive/DMS. (Note: while I do have a few documents with mixed languages in the same document, we can ignore that for now and focus only on single-language documents.)

Problem summary

For some PDFs that contain Japanese text, the "extracted text" in Docspell is just some random glyphs. This is specifically about PDFs that already contain text (I'm pretty sure it's not an OCR issue). I've also noticed that this problem doesn't occur for all Japanese-text PDFs, but I don't know the cause.

Reproducing the problem on v0.40.0

  1. Set up Docspell with docker compose, following the docker compose section of the installation manual. Since I wanted to use a release version, I deviated from the manual by downloading the docker compose file from tag v0.40.0 instead. Specifically, I did these steps:

    > cd /tmp
    > pwd
    /tmp
    > mkdir -p docspell/docker/docker-compose
    > cd docspell/docker/docker-compose
    > pwd
    /tmp/docspell/docker/docker-compose
    > wget https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml
    --2023-12-30 16:45:25--  https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 4740 (4.6K) [text/plain]
    Saving to: ‘docker-compose.yml’
    
    docker-compose.yml  100%[===================>]   4.63K  --.-KB/s    in 0s
    
    2023-12-30 16:45:25 (20.2 MB/s) - ‘docker-compose.yml’ saved [4740/4740]
    
    > ls -A
    docker-compose.yml
    > docker-compose up -d
    [+] Building 0.0s (0/0)                                    docker:desktop-linux
    [+] Running 8/8
    ✔ Network docker-compose_default                  Created                 0.0s
    ✔ Volume "docker-compose_docspell-postgres_data"  Created                 0.0s
    ✔ Volume "docker-compose_docspell-solr_data"      Created                 0.0s
    ✔ Container docspell-solr                         Started                 0.0s
    ✔ Container postgres_db                           Started                 0.1s
    ✔ Container docspell-joex                         Started                 0.1s
    ✔ Container docspell-restserver                   Started                 0.1s
    ✔ Container docspell-consumedir                   Started                 0.0s
    • FYI, here are the containers that were created, and the images that they use:

      > docker ps
      CONTAINER ID   IMAGE                        COMMAND                  CREATED         STATUS                   PORTS                    NAMES
      1045dbbbd81a   docspell/dsc:latest          "dsc -d http://docsp…"   5 minutes ago   Up 5 minutes                                      docspell-consumedir
      9afa894f6acb   docspell/joex:latest         "/opt/joex-entrypoin…"   5 minutes ago   Up 5 minutes (healthy)   0.0.0.0:7878->7878/tcp   docspell-joex
      90faedb8cdca   docspell/restserver:latest   "/opt/docspell-rests…"   5 minutes ago   Up 5 minutes (healthy)   0.0.0.0:7880->7880/tcp   docspell-restserver
      d0ea7652c5a4   solr:9                       "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes (healthy)   8983/tcp                 docspell-solr
      adf3c972c07f   postgres:15.2                "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes             5432/tcp                 postgres_db
      
      > docker image ls
      REPOSITORY           TAG     IMAGE ID       CREATED        SIZE
      solr                 9       3c38c30d646b   13 days ago    593MB
      postgres             15.2    bf700010ce28   8 months ago   379MB
      docspell/dsc         latest  54d581f6c5a1   9 months ago   20.1MB
      docspell/joex        latest  d129a81f07fd   9 months ago   1.99GB
      docspell/restserver  latest  1e700758d41a   9 months ago   336MB
  2. Open the web UI at http://localhost:7880 and create a new collective + user using the "Sign up!" button.

    • Collective ID: issuerepro User Login: issuerepro Password: issuerepro
  3. Download two example documents that contain Japanese text:

  4. Upload the documents to Docspell via the web UI:

    • Open the dashboard (http://localhost:7880/app/dashboard), and log in with user issuerepro that we created above.
    • Choose the files via drag-and-drop or using the "Select..." button in the drop area on the dashboard.
    • Click "Submit".
    • Wait for processing to finish (should be relatively quick, since no OCR needs to be done).
  5. Open the visa document 000472926.pdf, and go to "View extracted data":

    • CleanShot 2023-12-30 at 17 07 01@2x
    • The data looks pretty good: some extraneous whitespace, but overall mostly the right Japanese characters. Small sample and screenshot:

      身元保証書
      令和 € 月 日
      
      大 使 □
      
      在 日本国 殿
      
      総領事 □
      
      ビ ザ 申 請 人
      ※氏名必z旅券N~²ルフ±ベット表記w記載しvください。申請人|複数~場合{ï表者~身分事項²ñO{記入
      
      ~Nÿ申請人名簿²添付しvください。
      
      国 籍
      
      職 業

      CleanShot 2023-12-30 at 17 09 15@2x

  6. Open the insurance document standard_jyusetsu_20191201.pdf, and go to "View extracted data":

    • The data looks pretty bad: there are some Japanese characters in there, but there are a lot of random glyphs between them. Small sample and screenshot:

      ス¿ンÀーù÷害ß険 ݉Ï項~t®明ÿ݉Ï項®明þĀ
      イン¿ーネッø募Ö}
      
      契}概‰~t®明û注意喚起å報~t®明
      ■s~þ÷1¹¿ンÀーù÷ûßþ~4t~÷ùン<ë転ÎUけß険 ÿバイ¿ûĀ=1<ë転ÎUけß険
      
      ÿバイ¿û ベスøĀ=1<Á¼~ß険 îšÏ故=1<Á¼~ß険 å~~Ï故={·y»Ý‰zÏ項²®nw
      vい~y2tY}_{ßzz¯{zº1Y}w¿÷{uÛ~うえ1uÛÕ容{誤º|zいsx²úw1z
      w¿€uい2
      
      ■s~þ÷1tY}{·y»yyv~Õ容²š載wvい»‚~wあº~{³2詳}{tいv<tY}~w
      zºÿnšßþ}款û{}ÖĀ={š載wvい~y2_ûーĀúー¸{‚掲載wvい~y~w1߉{ßxv
      t参照€uいÿhttps://www.au-sonpo.co.jp/Ā2zz1t郵‘²希望u¼»|\aumß»¹¿þー»ン
      ¿ーxtËn€uい2

Some more version / environment information

Conclusion

I hope this report contains enough information to make the issue clear and to let you (try to) reproduce it.

Please let me know if there's any other information I can contribute to diagnose this.

Based on some web research, I'm afraid this issue might actually be related to how the PDF (and the fonts in it) are encoded; possibly some fonts are not properly included in the insurance document. I still hope there's something we can find out about this.

At the moment, I haven't found any other eDMS software that seems to fit my needs better or that handles Japanese PDFs better. So while I'm still a bit hesitant to invest completely into Docspell, I'm willing to try to diagnose and hopefully fix or mitigate this issue :)

lehnerpat commented 8 months ago

Quick update:

I saw in the worker logs and online results that this is probably related to PDFBox. I noticed that v0.40.0 uses PDFBox v2.0 and that PDFBox v3.0 was released quite recently, but Docspell's master branch is already updated to PDFBox v3.0 👀

So I re-tried using the nightly versions (i.e., using :nightly instead of :latest for all three docspell images in the docker compose file), specifically:

docspell/joex        nightly  b5506a7ff399   26 hours ago   2.09GB
docspell/restserver  nightly  1dbcd0bd96c6   26 hours ago   333MB
docspell/dsc         nightly  880e97d301f3   3 months ago   22.8MB

And the extraction now works much better! 🙌 🎉

Once I re-import the insurance document that was previously problematic, it looks like this:

CleanShot 2023-12-30 at 18 06 29@2x

It would be great if you could prepare another Docspell release that include PDFBox 3 soon 🙇

eikek commented 8 months ago

Hi @lehnerpat thank you for the very detailed report. My first guess when reading was "fonts" as well. It is sometimes difficult to diagnose this. Did I understand correctly (just to reassure for me), that pdfbox 3 fixed your issues?

I will try to make a release in not so far time.

lehnerpat commented 8 months ago

Hi @eikek, thank you for your quick response!

Did I understand correctly (just to reassure for me), that pdfbox 3 fixed your issues?

Yes, for the 2 files I tested, pdfbox 3 worked much better and fixed the text extraction issues 👍

I will try to make a release in not so far time.

Great, thank you!

tenpai-git commented 7 months ago

Hi @eikek - I am having the same problem here as @lehnerpat - however, I am not using docker. I am using a manual install with PostgreSQL.

How can I get docspell to parse with different fonts? I tried installing a fair amount on my system but it didn't seem to make much of a difference.

How can I configure the manual install to use pdfbox3? I didn't see pdfbox in the apt repository.

This would be hugely helpful and I'd try to promote docspell more in the Japanese opensource community.

eikek commented 7 months ago

Hi @tenpai-git pdfbox is not an external tool, but a library used by docspell. It is updated to its current version in the master branhc. If you install the snapshot versions from the release page, you'll use pdfbox3. For the system fonts, there is no easy way to know (that I know…). You need to install the fonts that are used in your pdfs (if they are not included in the pdf itself).

tenpai-git commented 7 months ago

Thanks @eikek - I see it now; sorry about that - and happy to report that it worked for me too!

Initially I had these; /usr/share/docspell-joex/lib/org.apache.pdfbox.pdfbox-2.0.27.jar /usr/share/docspell-joex/lib/org.apache.pdfbox.fontbox-2.0.27.jar /var/lib/docspell/.pdfbox.cache

With 0.41 Nightly I deleted the cache and showed; /usr/share/docspell-joex/lib/org.apache.pdfbox.fontbox-3.0.1.jar /usr/share/docspell-joex/lib/org.apache.pdfbox.pdfbox-io-3.0.1.jar /usr/share/docspell-joex/lib/org.apache.pdfbox.pdfbox-3.0.1.jar

Initially, it didn't work right after installation and restarting the services, but after a reboot on the LXC itself and installing a bunch of Japanese fonts (mincho, noto-sans-cjk, etc) it's now perfect and reading Japanese pretty well! I can also scan way more by setting /etc/default/docspell-joex with JAVA_OPTS="-Xmx3000M" or so. I think you need a little bit more memory than the default or recommended 1400 for OCR'ing more visually complex languages.

It really does take quite a bit of time to look up a kanji you don't know, so this will dramatically improve quality of life for anyone learning Japanese and the QR Code upload is so helpful for specifying different languages! You're doing great work @eikek - thank you! I think this alone makes it worth pushing it in the new build, since it largely increases the potential audience/use of Docspell!

tenpai-git commented 7 months ago

Oof - there is one thing I am noticing. When I do a full-text search for an OCR'd kanji, it doesn't seem to appear. I am certain that the text exists in the metadata, though. Postgresql is my backend.

Perhaps it is a problem with database encoding - how can I perform the PostgreSQL full-text search manually that Docspell performs on the DB for the extracted text to determine if it's postgresql or another issue?

@lehnerpat Can you search for that extracted metadata on your insurance form?

Metadata being read in correctly; 2024-01-17_17-12

But if I search through it, it returned no result (I am certain the document was included in the search). 2024-01-17_17-13

eikek commented 7 months ago

I think this alone makes it worth pushing it in the new build, since it largely increases the potential audience/use of Docspell!

Thank you for your kind words @tenpai-git ! I'll try to make a release soon. I had hoped to get some issues solved first. But maybe I'll do a release first.

For the search issues: Are you using postgresql as a search engine? Unfortunately, I have no experience with these languages. I think it is quite likely that PostgreSQL has no default support for your language. In this case you need to look through their docs and create a configuration. Then you can set this in the docspell config (there is some docs here).

tenpai-git commented 7 months ago

Okay I had to dig into Postgres, Tesseract, and Ocrmypdf a bit to figure some things out, but I was able to fully recreate the results @lehnerpat posted, though I'm still having some mixed results with the full-text search.

First off, PostgreSQL encoding on the docspell database should be UTF-8 encoding for full text search and storage I think. I don't know if the Docker container creates this encoding automatically since I haven't tried it, but that was something I overlooked in the manual install. Maybe a good note in the manual installation section would be to use UTF-8 Encoding.

If you've ever dealt with this error coming from a non-western default it's quite annoying, so I'll list all the steps I used to fix it (and fix your template permanently) on the Manual Install. Take multiple backups of all your databases off server just in case of error, but I was able to fix this without losing anything by doing the following steps:

  1. Shutdown docspell-joex and docspell-restserver services and run a database backup. sudo systemctl stop docspell-joex sudo systemctl stop docspell-restserver

  2. You can run the following as the postgres or db admin user. This will take time wait for it to finish. pg_dump docspelldb > docspelldb_backup.sql (Probably also a good idea to scp the backup off the server.)

  3. Open psql, hold your breath, and drop that database. DROP DATABASE docspelldb;

(If you want to fix this problem permanently for all your databases continue here, if you want to just do this once skip to step 7b. A warning that Step 2-7 will change your overall PostgreSQL DB settings.)

  1. Release the lock on template1, because we want to create a database with UTF encoding based on a new template and not the default. UPDATE pg_database SET datistemplate = FALSE WHERE datname = 'template1';

  2. Drop the old template. DROP DATABASE template1;

  3. Create a new template with UTF8 Encoding CREATE DATABASE template1 WITH TEMPLATE = template0 ENCODING = 'UNICODE';

  4. Set is as a template. UPDATE pg_database SET datistemplate = TRUE WHERE datname = 'template1';

  5. Connect to the new template and freeze postgresql analytics since it's just a template and not a full database (More details here). \c template1 VACUUM FREEZE;

7a. If you followed all the previous steps up to step 6, Make the new docspell database (docspell is my user) CREATE DATABASE docspelldb WITH OWNER = 'docspell' ENCODING = 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' template = 'template0';

7b. If you skipped here, make a new database like this (docspell is my user, docspelldb is my database name): CREATE DATABASE docspelldb WITH OWNER = 'docspell' ENCODING = 'UTF8' template = 'template0';

(Do not do 7b if you followed steps 3-7 and did 7a. Unrelated to this issue, but if you're doing this just to fix your full text search entirely for multiple Databases/US English and not other languages, you could optionally add after WITH LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8').

  1. Exit psql with \q and restore your backup to the db. psql docspelldb < docspelldb_backup.sql

It will take a while.

  1. Startup the docspell server components again. sudo systemctl start docspell-joex sudo systemctl start docspell-restserver

As long as you used the same name and user, docspell-joex and docspell-restserver should connect fine and restart to this new db. You can make a new user and update your docspell configuration if necessary.

However the db support and the pdfbox upgrade alone is not enough. Using the insurance document that @lehnerpat provided, I also started getting some gibberish glyph output after I made some db system updates unrelated to the db and installing new fonts trying to make it more accurate. This confounded me for a while but thanks to the processing logs and comments in the default config, I was able to find out that ocrmypdf was failing somehow on the conversion to PDF/A.

After messing with the file locally enough times, I figured out that ocrmypdf -l jpn ./input_pdf_ins.pdf ./output --output-type pdf --skip-text gave me good results, and the solution was forcing the --output-type to pdf.

So in the /etc/docspell-joex/docspell-joex.confconfig I added"--output-type", "pdf",` to the options (this should come after --skip-text, and a restart is required on docspell-joex).

     # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--deskew",
      "--output-type", "pdf",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]

And perfect! I recreated @lehnerpat's output.

Some example output in the metadata:

契約概要のご説明・注意喚起情報のご説明 

■この書面は、スタンダード傷害保険の4つのプラン「自転車向け保険 Bycle(バイクル)」、「自転車向け保険 Bycle 

Best(バイクル ベスト)」、「ケガの保険 交通事故」、「ケガの保険 日常の事故」に関する重要な事項を説明し
ています。ご契約前に必ずお読みになり、契約申込画面に入力のうえ、入力内容に誤りがないことを確認し、お
申込みください。 

Everything seems to be stored correctly in the database, and tools are working as intended, but I took some pictures of some documents and realized vertical text support wasn't working and also producing gibberish glyphs. So I dug into Tesseract, installed the tesseract-ocr-japn-vert package and got good test results with the following:

tesseract ~/Downloads/OCR_JP_TEST_VERTICAL.png output -l jpn_vert -c preserve_interword_spaces=1

You can temporarily rig the normal "Japanese" setting to this configuration in the /etc/docspell-joex/docspell-joex.conf config like this if you have the proper tesseract-ocr-japn-vert package and dependencies (don't try to upload non-vertical Tesseract languages with these settings, restart required on docspell-joex):

    # To convert image files to PDF files, tesseract is used. This
    # also extracts the text in one go.
    tesseract = {
      command = {
        program = "tesseract"
        args = [
          "{{infile}}",
          "out",
          "-l",
          "{{lang}}_vert",
          "-c",
          "preserve_interword_spaces=1",
          "pdf",
          "txt"
        ]

The package specifies vertical with jpn_vert and -c preserve_interword_spaces=1 removes pesky spaces not needed by the language in a book preview. The output is absolutely perfect and produces workable horizontal text from the vertical input with about 95% accuracy (only making minor errors for highly complex or uncommon kanji, lots of small furigana, or around line breaks). If you dig deeper into Tesseract you might find ways to improve that for the source picture (removing alpha, perfect borders, increasing resolution, eliminating furigana below a certain pixel size, etc) but it worked well enough for my purposes. ImageMagick can fix a lot/remove alpha if you're having issues with the source image for Tesseract. The metadata was very clean and search worked great.

So all in all that came out pretty okay, but I noticed something still inconsistent with the search...

The tesseract output is completely searchable and it has no database issues as far as I can tell.

But, even if the metadata output is perfect from ocrmypdf - the full text search still wasn't searching certain terms or words.

So even if I get the metadata and can read it, it still does not search this text well. I tried deleting and re-uploading the file, but I was still confused why this was happening.

After spending a lot of time querying different results, I found that something from the tokenization in Japanese relies either on punctuation or new lines, but this seemed to only occur in the ocrmypdf file and not the tesseract file.

So searching for a "keyword" in Japanese is still not possible if it's loaded in from ocrmypdf I think.

Still, this is a lot of progress! It's possible to upload, scan horizontal, scan vertical, get good output, and search in Japanese (if you use the whole line for files on ocrmypdf, or outright for anything else).

I'm not sure if the underlying tokenization issue is ocrmypdf, pdfbox, or postgresql, but I found a potential way to atleast manually search postgresql with Japanese. If you access your database directly it seems you can install this PostgreSQL script for advanced Japanese search. I will give it a try to see how it works. I guess @eikek I am wondering if I can directly access textsearch_ja.sql like in this answer and run something like SELECT ja_wakachi('分かち書きを行います。'); directly from docspell query? I can try this solution and run it directly from on my database (where is the ocrmypdf metadata stored from an attachment?).

I also think it would help if there was a default language option for Japanese (Vertical) using the above Tesseract options by default in a future version.

Thank you again @eikek for this software - it's a pleasure to work with. If I didn't see the processing log so easily I might have never figured this much out. It wasn't hard to understand the two config files and their inputs after digging in for a few hours and this beautiful simplicity makes Docspell way more useable than other solutions I've tried for multilinguals. I will keep working/committing time to Docspell as my DMS of choice if I can figure out more.

eikek commented 7 months ago

Hi @tenpai-git thank you for this detailed description! I'm sure this will help other people going into the same direction! It is amazing how deep you dived into this stuff!

I am wondering if I can directly access textsearch_ja.sql like in this answer and run something like SELECT ja_wakachi('分かち書きを行います。'); directly from docspell query?

I don't think this works directly from docspell. The query you can type in is parsed separately. There is no way to give it raw sql.

I also think it would help if there was a default language option for Japanese (Vertical) using the above Tesseract options by default in a future version.

Yes, I think this is a great idea. There must be better configuration per language I suppose. It just never occured so far :) It should be possible to specify commands per language perhaps. This can be a separate issue, so it is easier to track.

The output-type=pdf option seems to be a good idea to use in general, if I understand correctly?

where is the ocrmypdf metadata stored from an attachment?

For a pdf, it is first tried to read an existing text layer. If that has too less characters it does ocr with tesseract. If ocrmypdf was successful, then tesseract is (usually) not tried, because the text is already present in the pdf. You can force to always do ocr by setting pdf-min-length to some negative value. This text then ends up in the attachmentmeta table, column content.


I'm sure you already thought about it: another option is to try solr. It is also a beast to configure (I'd think at least for non-western stuff). I would expect solr to provide decent support for other languages using their provided plugins/modules, but otoh I never tried myself with these languages, so it's just a guess.

PS: If I missed/overlooked some question from above, please ask again 🙏🏼

eikek commented 7 months ago

@lehnerpat do you think this can be closed now that a new release has been made?

@tenpai-git I think it would be nice to create smaller issues from your findings to better support your language. Please feel free to create them - like what configuration would be needed to render the correct commands etc.

lehnerpat commented 7 months ago

Thank you very for following up on this!

@eikek Yes, I think with a new release that includes the updated pdfbox, I think this issue can be closed since the results are likely much better already 👌

I haven't read through all the details that @tenpai-git has found out, but I think there might be some ver valuable findings in there. Besides making docspell work better, these insights might also help the document archiving community more generally. For example, some things could be ported to other PDMSs and some other parts could be upstreamed to pdfbox/ocrmypdf (as code changes or documentation expansions).

In particular, I'm interested in ocrmypdf failing to produce a pdf/a file. Since pdf/a has some advantages for longterm storage, it would be nice to find a way to make that work even with these Japanese documents.

(I have since encountered other that are broken in similar ways and in other ways, so I'm also personally interested in this, even though I'm not using docspell!)

eikek commented 7 months ago

Hi @lehnerpat thanks for your feedback! I agree, there are insights that could improve other tools and/or their documentation. Since my time is very tight, I'm not going after them :). But I'm interested in improving Docspell in the long term, so it can better deal with these cases. That's why I'd like to ask to create more concrete issues for things that we could improve here (like config options for the commands etc). It would be also possible to have a documentation page about Japanese or similar languages, or it could be a blog post that shows a setup for this (it has not so much pages anyways :)). I don't have a strong opinion tbh.

I think I can't help much with the ocrmypdf issues/questions, they should be probably raised/asked on their space.

tenpai-git commented 7 months ago

Hi @eikek

There is no way to give it raw sql.

Understood that it won't be possible to make those calls directly. I understand the need for the data input validation, it would probably create a vulnerability or 100. Maybe there's a way to call some of those scripts via some kind of limited functionality later. I will experiment on my own and bring it up if I find anything useful.

The output-type=pdf option seems to be a good idea to use in general, if I understand correctly?

As far as I can tell from this limited issue and sample set, I believe so. Without it, I wasn't able to process the insurance document at all for some reason and I isolated that as the cause. In my mind more files being convertible is better as long as it doesn't affect anything else. Like @lehnerpat said there may be some storage consequences.

This text then ends up in the attachmentmeta table, column content.

Thank you, sorry to bother with you that. I will see if I can get the script working and maybe propose some kind of special search option later for it.

solr

Did consider it, but was committed to PostgreSQL on my home setup :)

I would be happy to write some documentation about this, so I guess all the possibilities include:

I will make pull requests or issues for these (3) things in some time.

Likely outside of the scope I'll work on, but for others to consider in the future:

I want to thank @lehnerpat for opening this issue and @eikek for addressing it so thoroughly. I think there's enough information here for someone to get it working temporarily if they desperately need it. I will plan on working on the other issues/pull requests in the mean time to the degree I am able.

I look forward to contributing @eikek thank you so much for your time and effort on Docspell. I am excited to use this in my infrastructure.