ai25395 / FMatPix

A free portable Formula Ocr tool supporting latex and mathml
17 stars 0 forks source link

Cyrillic characters are not recognized #1

Open GrayWolfson opened 5 hours ago

GrayWolfson commented 5 hours ago

Hi Using the formula below as an example https://filestore.community.support.microsoft.com/api/images/3004665f-3c2c-408d-92e1-67a1acda7a57?upload=true

Cyrillic characters are not recognized. c{\mathrm{p}\mathrm{i}}=\frac{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{\mathrm{i}}^{B O A})}{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{1}^{B O A}){\alpha=0}}

изображение

c{\mathrm{p}\mathrm{i}}=\frac{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{\mathrm{i}}^{B O A})}{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{1}^{B O A}){\alpha=0}}

Ideally there should be

c{\mathrm{p}\mathrm{i}}=\frac{(\mathrm{h}{\mathrm{ст}}^{ВОД}-\mathrm{h}{\mathrm{i}}^{ВОД})}{(\mathrm{h}{\mathrm{ст}}^{ВОД}-\mathrm{h}{1}^{ВОД}){\alpha=0}}

ai25395 commented 3 hours ago

The recognition models I use are open source based on transformer arch, so they usually have a file called like tokenizer.json specifying what characters the model should learn to recognize. As far as I know, the tokenizer.json of the mainstream models does not contain the Cyrillic alphabet, which is why they are not recognized.