Tesseract tessdata finetuned to detect Foxhole's font

GICodeWarrior / fir

Foxhole Inventory Report

MIT License

21 stars 9 forks source link

Tesseract tessdata finetuned to detect Foxhole's font #8

Closed Kubuxu closed 1 year ago

Kubuxu commented 1 year ago

I've finetuned the Tesseract model to detect better the font used by Foxhole with fewer errors (the font used by foxhole is Jost). It for example doesn't mistake the 0 and O anymore.

If you are interested, give it a try. The integer model is much smaller and, from my tests, as good as the float (final) model. engJost-integer.traineddata.gz engJost-final.traineddata.gz

GICodeWarrior commented 1 year ago

Thanks! I will test this out.

What character set does this cover?

Can you share the scripts / process you used to generate these?

Kubuxu commented 1 year ago

I've used https://github.com/Shreeshrii/tess5train-fonts/ It requires Tesseract5 binaries, and some workarounds to work. Command I used to run it: bash finetune_font.sh eng Latin eng engJost FineTune ' "Jost* Medium" "Jost*" "Jost* Light" ' ' "Jost* Medium" "Jost*" "Jost* Light" ' 0 9999 2 | tee data/logs/engJost.log

It covers https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/langdata/engImpact-eval.training_text, so I think majority of ASCII and some most common Unicode.

GICodeWarrior commented 1 year ago

Thanks!

Do you have some example screenshots where fir has returned incorrect results? I'd like to add them to my test cases.

For the training data, it would be cool to generate some based on text from the Foxhole translations files. It would be great to recognize the stockpile types in each different language.

Also, Foxhole uses Renner primarily. I believe Jost was a previous font and is similar.

Kubuxu commented 1 year ago

I don't have any screenshots for fir returning wrong results. I haven't worked with fir much (yet). I was planning to use tesseract for scanning stockpile logs, but it was messing up, even with perfectly clean text. The stockpile logs turned out are limited in length, rendering them useless for my purpose (logistic tracking, consumption tracking and forecasting).

This is how I stumbled onto fir through a dev in FMAT and an exchange officer between SPUD and 27th.

I can run the fine-tuning on Renner. If you have a list of strings you want it to detect better, send it my way. I can fine-tune it on that in addition.

Kubuxu commented 1 year ago

I have one example where the stockpile name detection (thus Tesseract) makes a small mistake. It adds a space between a T and the following 1. Example screenshot: War-Win64-Shipping_2023-03-22_13-58-43 Using the engJost-integer model parses it correctly. It is a minor issue because it reliably inserts that additional space.

Kubuxu commented 1 year ago

Also Renner was renamed to Jost https://www.dafont.com/renner.font

GICodeWarrior commented 1 year ago

Thanks, this is all helpful. Stockpile name recognition accuracy has been a recurring issue.

This is something I'd like to work on, but it will be a few weeks before I have time to dig in.

Kubuxu commented 1 year ago

I have run a better finetune for 27th stockpile naming (+ asci translations of storage depot and seaport). The result is here: engJost-final3.traineddata.gz IDK how well it will work with the Sundial's naming scheme. Ours is 27[A-Z]{3,4}-[IOBAEF]\d\d

GICodeWarrior commented 1 year ago

This has been integrated and solves all currently known stockpile name recognition errors.

Thanks again!