Closed lehnerpat closed 4 months ago
Quick update:
I saw in the worker logs and online results that this is probably related to PDFBox. I noticed that v0.40.0 uses PDFBox v2.0 and that PDFBox v3.0 was released quite recently, but Docspell's master branch is already updated to PDFBox v3.0 👀
So I re-tried using the nightly versions (i.e., using :nightly
instead of :latest
for all three docspell images in the docker compose file), specifically:
docspell/joex nightly b5506a7ff399 26 hours ago 2.09GB
docspell/restserver nightly 1dbcd0bd96c6 26 hours ago 333MB
docspell/dsc nightly 880e97d301f3 3 months ago 22.8MB
And the extraction now works much better! 🙌 🎉
Once I re-import the insurance document that was previously problematic, it looks like this:
It would be great if you could prepare another Docspell release that include PDFBox 3 soon 🙇
Hi @lehnerpat thank you for the very detailed report. My first guess when reading was "fonts" as well. It is sometimes difficult to diagnose this. Did I understand correctly (just to reassure for me), that pdfbox 3 fixed your issues?
I will try to make a release in not so far time.
Hi @eikek, thank you for your quick response!
Did I understand correctly (just to reassure for me), that pdfbox 3 fixed your issues?
Yes, for the 2 files I tested, pdfbox 3 worked much better and fixed the text extraction issues 👍
I will try to make a release in not so far time.
Great, thank you!
Hi @eikek - I am having the same problem here as @lehnerpat - however, I am not using docker. I am using a manual install with PostgreSQL.
How can I get docspell to parse with different fonts? I tried installing a fair amount on my system but it didn't seem to make much of a difference.
How can I configure the manual install to use pdfbox3? I didn't see pdfbox
in the apt
repository.
This would be hugely helpful and I'd try to promote docspell more in the Japanese opensource community.
Hi @tenpai-git pdfbox is not an external tool, but a library used by docspell. It is updated to its current version in the master branhc. If you install the snapshot versions from the release page, you'll use pdfbox3. For the system fonts, there is no easy way to know (that I know…). You need to install the fonts that are used in your pdfs (if they are not included in the pdf itself).
Thanks @eikek - I see it now; sorry about that - and happy to report that it worked for me too!
Initially I had these; /usr/share/docspell-joex/lib/org.apache.pdfbox.pdfbox-2.0.27.jar /usr/share/docspell-joex/lib/org.apache.pdfbox.fontbox-2.0.27.jar /var/lib/docspell/.pdfbox.cache
With 0.41 Nightly I deleted the cache and showed; /usr/share/docspell-joex/lib/org.apache.pdfbox.fontbox-3.0.1.jar /usr/share/docspell-joex/lib/org.apache.pdfbox.pdfbox-io-3.0.1.jar /usr/share/docspell-joex/lib/org.apache.pdfbox.pdfbox-3.0.1.jar
Initially, it didn't work right after installation and restarting the services, but after a reboot on the LXC itself and installing a bunch of Japanese fonts (mincho, noto-sans-cjk, etc) it's now perfect and reading Japanese pretty well! I can also scan way more by setting /etc/default/docspell-joex
with JAVA_OPTS="-Xmx3000M"
or so. I think you need a little bit more memory than the default or recommended 1400 for OCR'ing more visually complex languages.
It really does take quite a bit of time to look up a kanji you don't know, so this will dramatically improve quality of life for anyone learning Japanese and the QR Code upload is so helpful for specifying different languages! You're doing great work @eikek - thank you! I think this alone makes it worth pushing it in the new build, since it largely increases the potential audience/use of Docspell!
Oof - there is one thing I am noticing. When I do a full-text search for an OCR'd kanji, it doesn't seem to appear. I am certain that the text exists in the metadata, though. Postgresql is my backend.
Perhaps it is a problem with database encoding - how can I perform the PostgreSQL full-text search manually that Docspell performs on the DB for the extracted text to determine if it's postgresql or another issue?
@lehnerpat Can you search for that extracted metadata on your insurance form?
Metadata being read in correctly;
But if I search through it, it returned no result (I am certain the document was included in the search).
I think this alone makes it worth pushing it in the new build, since it largely increases the potential audience/use of Docspell!
Thank you for your kind words @tenpai-git ! I'll try to make a release soon. I had hoped to get some issues solved first. But maybe I'll do a release first.
For the search issues: Are you using postgresql as a search engine? Unfortunately, I have no experience with these languages. I think it is quite likely that PostgreSQL has no default support for your language. In this case you need to look through their docs and create a configuration. Then you can set this in the docspell config (there is some docs here).
Okay I had to dig into Postgres, Tesseract, and Ocrmypdf a bit to figure some things out, but I was able to fully recreate the results @lehnerpat posted, though I'm still having some mixed results with the full-text search.
First off, PostgreSQL encoding on the docspell database should be UTF-8 encoding for full text search and storage I think. I don't know if the Docker container creates this encoding automatically since I haven't tried it, but that was something I overlooked in the manual install. Maybe a good note in the manual installation section would be to use UTF-8 Encoding.
If you've ever dealt with this error coming from a non-western default it's quite annoying, so I'll list all the steps I used to fix it (and fix your template permanently) on the Manual Install. Take multiple backups of all your databases off server just in case of error, but I was able to fix this without losing anything by doing the following steps:
Shutdown docspell-joex and docspell-restserver services and run a database backup.
sudo systemctl stop docspell-joex
sudo systemctl stop docspell-restserver
You can run the following as the postgres
or db admin user. This will take time wait for it to finish.
pg_dump docspelldb > docspelldb_backup.sql
(Probably also a good idea to scp
the backup off the server.)
Open psql
, hold your breath, and drop that database.
DROP DATABASE docspelldb;
(If you want to fix this problem permanently for all your databases continue here, if you want to just do this once skip to step 7b. A warning that Step 2-7 will change your overall PostgreSQL DB settings.)
Release the lock on template1, because we want to create a database with UTF encoding based on a new template and not the default.
UPDATE pg_database SET datistemplate = FALSE WHERE datname = 'template1';
Drop the old template.
DROP DATABASE template1;
Create a new template with UTF8 Encoding
CREATE DATABASE template1 WITH TEMPLATE = template0 ENCODING = 'UNICODE';
Set is as a template.
UPDATE pg_database SET datistemplate = TRUE WHERE datname = 'template1';
Connect to the new template and freeze postgresql analytics since it's just a template and not a full database (More details here).
\c template1
VACUUM FREEZE;
7a. If you followed all the previous steps up to step 6, Make the new docspell database (docspell is my user)
CREATE DATABASE docspelldb WITH OWNER = 'docspell' ENCODING = 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' template = 'template0';
7b. If you skipped here, make a new database like this (docspell is my user, docspelldb is my database name):
CREATE DATABASE docspelldb WITH OWNER = 'docspell' ENCODING = 'UTF8' template = 'template0';
(Do not do 7b if you followed steps 3-7 and did 7a. Unrelated to this issue, but if you're doing this just to fix your full text search entirely for multiple Databases/US English and not other languages, you could optionally add after WITH
LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8'
).
psql
with \q
and restore your backup to the db.
psql docspelldb < docspelldb_backup.sql
It will take a while.
sudo systemctl start docspell-joex
sudo systemctl start docspell-restserver
As long as you used the same name and user, docspell-joex and docspell-restserver should connect fine and restart to this new db. You can make a new user and update your docspell configuration if necessary.
However the db support and the pdfbox upgrade alone is not enough. Using the insurance document that @lehnerpat provided, I also started getting some gibberish glyph output after I made some db system updates unrelated to the db and installing new fonts trying to make it more accurate. This confounded me for a while but thanks to the processing logs and comments in the default config, I was able to find out that ocrmypdf
was failing somehow on the conversion to PDF/A.
After messing with the file locally enough times, I figured out that ocrmypdf -l jpn ./input_pdf_ins.pdf ./output --output-type pdf --skip-text
gave me good results, and the solution was forcing the --output-type
to pdf
.
So in the /etc/
docspell-joex/docspell-joex.confconfig I added
"--output-type", "pdf",` to the options (this should come after --skip-text, and a restart is required on docspell-joex).
# The `--skip-text` option is necessary to not fail on "text" pdfs
# (where ocr is not necessary). In this case, the pdf will be
# converted to PDF/A.
ocrmypdf = {
enabled = true
command = {
program = "ocrmypdf"
args = [
"-l", "{{lang}}",
"--skip-text",
"--deskew",
"--output-type", "pdf",
"-j", "1",
"{{infile}}",
"{{outfile}}"
]
And perfect! I recreated @lehnerpat's output.
Some example output in the metadata:
契約概要のご説明・注意喚起情報のご説明
■この書面は、スタンダード傷害保険の4つのプラン「自転車向け保険 Bycle(バイクル)」、「自転車向け保険 Bycle
Best(バイクル ベスト)」、「ケガの保険 交通事故」、「ケガの保険 日常の事故」に関する重要な事項を説明し
ています。ご契約前に必ずお読みになり、契約申込画面に入力のうえ、入力内容に誤りがないことを確認し、お
申込みください。
Everything seems to be stored correctly in the database, and tools are working as intended, but I took some pictures of some documents and realized vertical text support wasn't working and also producing gibberish glyphs. So I dug into Tesseract, installed the tesseract-ocr-japn-vert
package and got good test results with the following:
tesseract ~/Downloads/OCR_JP_TEST_VERTICAL.png output -l jpn_vert -c preserve_interword_spaces=1
You can temporarily rig the normal "Japanese" setting to this configuration in the /etc/docspell-joex/docspell-joex.conf
config like this if you have the proper tesseract-ocr-japn-vert
package and dependencies (don't try to upload non-vertical Tesseract languages with these settings, restart required on docspell-joex):
# To convert image files to PDF files, tesseract is used. This
# also extracts the text in one go.
tesseract = {
command = {
program = "tesseract"
args = [
"{{infile}}",
"out",
"-l",
"{{lang}}_vert",
"-c",
"preserve_interword_spaces=1",
"pdf",
"txt"
]
The package specifies vertical with jpn_vert
and -c preserve_interword_spaces=1
removes pesky spaces not needed by the language in a book preview. The output is absolutely perfect and produces workable horizontal text from the vertical input with about 95% accuracy (only making minor errors for highly complex or uncommon kanji, lots of small furigana, or around line breaks). If you dig deeper into Tesseract you might find ways to improve that for the source picture (removing alpha, perfect borders, increasing resolution, eliminating furigana below a certain pixel size, etc) but it worked well enough for my purposes. ImageMagick can fix a lot/remove alpha if you're having issues with the source image for Tesseract. The metadata was very clean and search worked great.
So all in all that came out pretty okay, but I noticed something still inconsistent with the search...
The tesseract
output is completely searchable and it has no database issues as far as I can tell.
But, even if the metadata output is perfect from ocrmypdf
- the full text search still wasn't searching certain terms or words.
So even if I get the metadata and can read it, it still does not search this text well. I tried deleting and re-uploading the file, but I was still confused why this was happening.
After spending a lot of time querying different results, I found that something from the tokenization in Japanese relies either on punctuation or new lines, but this seemed to only occur in the ocrmypdf file and not the tesseract file.
So searching for a "keyword" in Japanese is still not possible if it's loaded in from ocrmypdf I think.
Still, this is a lot of progress! It's possible to upload, scan horizontal, scan vertical, get good output, and search in Japanese (if you use the whole line for files on ocrmypdf, or outright for anything else).
I'm not sure if the underlying tokenization issue is ocrmypdf, pdfbox, or postgresql, but I found a potential way to atleast manually search postgresql with Japanese. If you access your database directly it seems you can install this PostgreSQL script for advanced Japanese search. I will give it a try to see how it works. I guess @eikek I am wondering if I can directly access textsearch_ja.sql
like in this answer and run something like SELECT ja_wakachi('分かち書きを行います。');
directly from docspell query? I can try this solution and run it directly from on my database (where is the ocrmypdf metadata stored from an attachment?).
I also think it would help if there was a default language option for Japanese (Vertical) using the above Tesseract options by default in a future version.
Thank you again @eikek for this software - it's a pleasure to work with. If I didn't see the processing log so easily I might have never figured this much out. It wasn't hard to understand the two config files and their inputs after digging in for a few hours and this beautiful simplicity makes Docspell way more useable than other solutions I've tried for multilinguals. I will keep working/committing time to Docspell as my DMS of choice if I can figure out more.
Hi @tenpai-git thank you for this detailed description! I'm sure this will help other people going into the same direction! It is amazing how deep you dived into this stuff!
I am wondering if I can directly access
textsearch_ja.sql
like in this answer and run something likeSELECT ja_wakachi('分かち書きを行います。');
directly from docspell query?
I don't think this works directly from docspell. The query you can type in is parsed separately. There is no way to give it raw sql.
I also think it would help if there was a default language option for Japanese (Vertical) using the above Tesseract options by default in a future version.
Yes, I think this is a great idea. There must be better configuration per language I suppose. It just never occured so far :) It should be possible to specify commands per language perhaps. This can be a separate issue, so it is easier to track.
The output-type=pdf
option seems to be a good idea to use in general, if I understand correctly?
where is the ocrmypdf metadata stored from an attachment?
For a pdf, it is first tried to read an existing text layer. If that has too less characters it does ocr with tesseract. If ocrmypdf was successful, then tesseract is (usually) not tried, because the text is already present in the pdf. You can force to always do ocr by setting pdf-min-length
to some negative value. This text then ends up in the attachmentmeta
table, column content
.
I'm sure you already thought about it: another option is to try solr. It is also a beast to configure (I'd think at least for non-western stuff). I would expect solr to provide decent support for other languages using their provided plugins/modules, but otoh I never tried myself with these languages, so it's just a guess.
PS: If I missed/overlooked some question from above, please ask again 🙏🏼
@lehnerpat do you think this can be closed now that a new release has been made?
@tenpai-git I think it would be nice to create smaller issues from your findings to better support your language. Please feel free to create them - like what configuration would be needed to render the correct commands etc.
Thank you very for following up on this!
@eikek Yes, I think with a new release that includes the updated pdfbox, I think this issue can be closed since the results are likely much better already 👌
I haven't read through all the details that @tenpai-git has found out, but I think there might be some ver valuable findings in there. Besides making docspell work better, these insights might also help the document archiving community more generally. For example, some things could be ported to other PDMSs and some other parts could be upstreamed to pdfbox/ocrmypdf (as code changes or documentation expansions).
In particular, I'm interested in ocrmypdf failing to produce a pdf/a file. Since pdf/a has some advantages for longterm storage, it would be nice to find a way to make that work even with these Japanese documents.
(I have since encountered other that are broken in similar ways and in other ways, so I'm also personally interested in this, even though I'm not using docspell!)
Hi @lehnerpat thanks for your feedback! I agree, there are insights that could improve other tools and/or their documentation. Since my time is very tight, I'm not going after them :). But I'm interested in improving Docspell in the long term, so it can better deal with these cases. That's why I'd like to ask to create more concrete issues for things that we could improve here (like config options for the commands etc). It would be also possible to have a documentation page about Japanese or similar languages, or it could be a blog post that shows a setup for this (it has not so much pages anyways :)). I don't have a strong opinion tbh.
I think I can't help much with the ocrmypdf issues/questions, they should be probably raised/asked on their space.
Hi @eikek
There is no way to give it raw sql.
Understood that it won't be possible to make those calls directly. I understand the need for the data input validation, it would probably create a vulnerability or 100. Maybe there's a way to call some of those scripts via some kind of limited functionality later. I will experiment on my own and bring it up if I find anything useful.
The output-type=pdf option seems to be a good idea to use in general, if I understand correctly?
As far as I can tell from this limited issue and sample set, I believe so. Without it, I wasn't able to process the insurance document at all for some reason and I isolated that as the cause. In my mind more files being convertible is better as long as it doesn't affect anything else. Like @lehnerpat said there may be some storage consequences.
This text then ends up in the
attachmentmeta
table, columncontent
.
Thank you, sorry to bother with you that. I will see if I can get the script working and maybe propose some kind of special search option later for it.
solr
Did consider it, but was committed to PostgreSQL on my home setup :)
I would be happy to write some documentation about this, so I guess all the possibilities include:
output-type=pdf
I will make pull requests or issues for these (3) things in some time.
Likely outside of the scope I'll work on, but for others to consider in the future:
ocrmypdf
tokenizes Japanese to make for better search of single words or terms using PostgreSQL full text search. Might be able to find this more easily, but will need to learn the ocrmypdf
project itself better to understand how to improve it.tesseract
crop/remove alpha/and darken text to make for better reading of Japanese documents. Requires someone pretty handy with ImageMagick I think. I'll try to figure out some sane defaults. textsearch_ja.sql
can use hiragana/katakana/furigana and other types of metadata/scripts entered for Japanese search in addition to the plain kanji. I'll need to look into the tool itself and deeper into the inner workings of PostgreSQL to figure it out. I want to thank @lehnerpat for opening this issue and @eikek for addressing it so thoroughly. I think there's enough information here for someone to get it working temporarily if they desperately need it. I will plan on working on the other issues/pull requests in the mean time to the degree I am able.
I look forward to contributing @eikek thank you so much for your time and effort on Docspell. I am excited to use this in my infrastructure.
Intro
Hi everyone,
thank you very much for creating and working on Docspell!
I've been wanting to get started with digitally organizing my documents for a while now. I found Docspell as one solution that might work well for me, so I've started trying it out.
One thing upfront: my use case is probably a bit unusual, since I have documents in three languages (German, English, and Japanese) that I want to put into my archive/DMS. (Note: while I do have a few documents with mixed languages in the same document, we can ignore that for now and focus only on single-language documents.)
Problem summary
For some PDFs that contain Japanese text, the "extracted text" in Docspell is just some random glyphs. This is specifically about PDFs that already contain text (I'm pretty sure it's not an OCR issue). I've also noticed that this problem doesn't occur for all Japanese-text PDFs, but I don't know the cause.
Reproducing the problem on v0.40.0
Set up Docspell with docker compose, following the docker compose section of the installation manual. Since I wanted to use a release version, I deviated from the manual by downloading the docker compose file from tag v0.40.0 instead. Specifically, I did these steps:
FYI, here are the containers that were created, and the images that they use:
Open the web UI at
http://localhost:7880
and create a new collective + user using the "Sign up!" button.issuerepro
User Login:issuerepro
Password:issuerepro
Download two example documents that contain Japanese text:
Upload the documents to Docspell via the web UI:
http://localhost:7880/app/dashboard
), and log in with userissuerepro
that we created above.Open the visa document
000472926.pdf
, and go to "View extracted data":The data looks pretty good: some extraneous whitespace, but overall mostly the right Japanese characters. Small sample and screenshot:
Open the insurance document
standard_jyusetsu_20191201.pdf
, and go to "View extracted data":The data looks pretty bad: there are some Japanese characters in there, but there are a lot of random glyphs between them. Small sample and screenshot:
Some more version / environment information
Conclusion
I hope this report contains enough information to make the issue clear and to let you (try to) reproduce it.
Please let me know if there's any other information I can contribute to diagnose this.
Based on some web research, I'm afraid this issue might actually be related to how the PDF (and the fonts in it) are encoded; possibly some fonts are not properly included in the insurance document. I still hope there's something we can find out about this.
At the moment, I haven't found any other eDMS software that seems to fit my needs better or that handles Japanese PDFs better. So while I'm still a bit hesitant to invest completely into Docspell, I'm willing to try to diagnose and hopefully fix or mitigate this issue :)