clovaai / deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019
Apache License 2.0
3.77k stars 1.11k forks source link

UnicodeEncodeError: 'charmap' codec can't encode characters in position 578-694: character maps to <undefined> #429

Open ryntml opened 1 month ago

ryntml commented 1 month ago

I am currently trying to do a training on Ottoman Turkish. This language consists of a mixture of the Arabic alphabet and the Persian alphabet. I created all the datasets, the moment I run train.py I get the following error:

Screenshot 2024-10-04 212521

A small example from labels.txt:

Screenshot 2024-10-04 213535

Even though I do UTF-8 encoding, I still get errors.

There is this problem with the characters:

This language, like Arabic, is written differently at the beginning, middle and end, and that's why I wrote all the characters. For example, I added 3 spellings of the letter Noon. Could this cause a problem? Does anyone know? Thank you.