Closed Kubuxu closed 1 year ago
Thanks! I will test this out.
What character set does this cover?
Can you share the scripts / process you used to generate these?
I've used https://github.com/Shreeshrii/tess5train-fonts/
It requires Tesseract5 binaries, and some workarounds to work.
Command I used to run it:
bash finetune_font.sh eng Latin eng engJost FineTune ' "Jost* Medium" "Jost*" "Jost* Light" ' ' "Jost* Medium" "Jost*" "Jost* Light" ' 0 9999 2 | tee data/logs/engJost.log
It covers https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/langdata/engImpact-eval.training_text, so I think majority of ASCII and some most common Unicode.
Thanks!
Do you have some example screenshots where fir has returned incorrect results? I'd like to add them to my test cases.
For the training data, it would be cool to generate some based on text from the Foxhole translations files. It would be great to recognize the stockpile types in each different language.
Also, Foxhole uses Renner primarily. I believe Jost was a previous font and is similar.
I don't have any screenshots for fir returning wrong results. I haven't worked with fir much (yet). I was planning to use tesseract for scanning stockpile logs, but it was messing up, even with perfectly clean text. The stockpile logs turned out are limited in length, rendering them useless for my purpose (logistic tracking, consumption tracking and forecasting).
This is how I stumbled onto fir through a dev in FMAT and an exchange officer between SPUD and 27th.
I can run the fine-tuning on Renner. If you have a list of strings you want it to detect better, send it my way. I can fine-tune it on that in addition.
I have one example where the stockpile name detection (thus Tesseract) makes a small mistake. It adds a space between a T and the following 1. Example screenshot: Using the engJost-integer model parses it correctly. It is a minor issue because it reliably inserts that additional space.
Also Renner was renamed to Jost https://www.dafont.com/renner.font
Thanks, this is all helpful. Stockpile name recognition accuracy has been a recurring issue.
This is something I'd like to work on, but it will be a few weeks before I have time to dig in.
I have run a better finetune for 27th stockpile naming (+ asci translations of storage depot and seaport). The result is here:
engJost-final3.traineddata.gz
IDK how well it will work with the Sundial's naming scheme. Ours is 27[A-Z]{3,4}-[IOBAEF]\d\d
This has been integrated and solves all currently known stockpile name recognition errors.
Thanks again!
I've finetuned the Tesseract model to detect better the font used by Foxhole with fewer errors (the font used by foxhole is Jost). It for example doesn't mistake the 0 and O anymore.
If you are interested, give it a try. The integer model is much smaller and, from my tests, as good as the float (final) model. engJost-integer.traineddata.gz engJost-final.traineddata.gz