Request to make Tesseract Unicode compatible

GoogleCodeExporter commented 9 years ago


A lot of efforts have already been done to introduce UTF-8 into Tesseract. But 
this is currently limited to the recognized texts and the training data.

What is missing is support for Unicode paths in all filenames.
On Windows nowadays all applications use Unicode paths.
ANSI was in the 1990's and is not used anymore.
So when a japanese user wants to open an image in a folder with a japanese name 
Tesseract will not find the file.

The least invasive change would be to replace in all locations fopen() in 
Tesseract with a new function fopen_utf8() that allows to pass UTF-8 encoded 
paths. This function should be enabled by compiler switch on Windows and 
replaced with something similar on MAC/Linux.

        static FILE* fopen_utf8(const char* s8_UtfPath, const wchar_t* u16_Mode)
        {
            const int BUF_SIZE = 1000;
            wchar_t u16_Path[BUF_SIZE +1];

            int s32_Len = MultiByteToWideChar(CP_UTF8, 0, s8_UtfPath, -1, u16_Path, BUF_SIZE);
            u16_Path[s32_Len] = 0;

            FILE* f_File = _wfopen(u16_Path, u16_Mode);

            if (f_File != NULL)
            {
                // Skip UTF-8 BOM
                if (fgetc(f_File) != 0xEF || 
                    fgetc(f_File) != 0xBB ||
                    fgetc(f_File) != 0xBF)
                {
                    fseek(f_File, 0, SEEK_SET);
                }
            }

            return f_File;
        }

Additionally this file automatically skips the UTF-8 BOM in the file if there 
is any. This gives the user the freedom to pass files with or without UTF8 BOM.

Original issue reported on code.google.com by smaragds...@gmail.com on 16 Jul 2014 at 5:32

GoogleCodeExporter commented 9 years ago

Windows is yuk!
Looks like Linux just works with utf-8 filenames, and only windows has to 
convert to wchar_t.

I will include this with a planned change to switch to TFile everywhere. That 
will reduce the number of places where fopen has to be changed.

Original comment by theraysm...@gmail.com on 12 Sep 2014 at 12:59

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

When Unicode came up Microsoft has made big efforts to convert the entire OS to 
Unicode while Linux only implemented a cheap workaround. On Linux you pass 
Unicode paths as UTF8. This looks fine on the first look. But if you work more 
with file paths on Linux you will note the drawbacks:

Imagine you write a function on Linux to read the content of an entire Unicode 
folder tree and then you want to sort the files alphabetically. You can't sort 
UTF8 paths. You have to convert each and every filename to UTF16, sort the 
files and then convert back to UTF8. This is inefficient. Microsoft clearly did 
the better work here. In a C++ project for Windows you store all paths in a 
wstring and that's it!

Original comment by smaragds...@gmail.com on 13 Sep 2014 at 3:31

fcheng00 / tesseract-ocr

Request to make Tesseract Unicode compatible #1257