Closed GrayWolfson closed 1 month ago
The recognition models I use are open source based on transformer arch, so they usually have a file called like tokenizer.json specifying what characters the model should learn to recognize. As far as I know, the tokenizer.json of the mainstream models does not contain the Cyrillic alphabet, which is why they are not recognized.
Hi Using the formula below as an example https://filestore.community.support.microsoft.com/api/images/3004665f-3c2c-408d-92e1-67a1acda7a57?upload=true
Cyrillic characters are not recognized. c{\mathrm{p}\mathrm{i}}=\frac{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{\mathrm{i}}^{B O A})}{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{1}^{B O A}){\alpha=0}}
c{\mathrm{p}\mathrm{i}}=\frac{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{\mathrm{i}}^{B O A})}{(\mathrm{h}{\mathrm{cr}}^{B O A}-\mathrm{h}{1}^{B O A}){\alpha=0}}
Ideally there should be
c{\mathrm{p}\mathrm{i}}=\frac{(\mathrm{h}{\mathrm{ст}}^{ВОД}-\mathrm{h}{\mathrm{i}}^{ВОД})}{(\mathrm{h}{\mathrm{ст}}^{ВОД}-\mathrm{h}{1}^{ВОД}){\alpha=0}}