Improve encoding detection

EasyRPG / liblcf

Library to handle RPG Maker 2000/2003 and EasyRPG projects

https://easyrpg.org

MIT License

113 stars 52 forks source link

Improve encoding detection #435

Closed Ghabry closed 2 years ago

Ghabry commented 2 years ago

The first commit is not really needed for it but it makes it after 0.7 easier to replace StringView with std::string_view (C++17). This is API compatible, no changes needed to consumers.

The second commit changes the Encoding Api from Streams to a Database-handle. Because we do not operate on global objects anymore the database checked by player was always empty. This fixes it (Player loads the DB and passes it in). I also noticed that for whatever reason the language detection is better when system tab is at the beginning (maybe filenames are better than terms because they are usually unaffected by translations?)

Ghabry commented 2 years ago

Thats a good question. ICU seems to detect UTF-8 in that case. This may not be ideal, usually you want western european in that case...

Ghabry commented 2 years ago

Checked now the ICU sourcecode: They use ngrams (Split string in pairs of N chars) for language detection (this is an approach that works surprisingly good btw) so removing ASCII strings will actually make the result worse as it messes up the distribution. I will remove it and test again.

(For the ZIP archive encoding detection were I took this from this still makes sense though as all files are known beforehand, so ASCII implies UTF-8)

Ghabry commented 2 years ago

rechecked this: using the system tab before terms is good enough. Ascii filter was nonsense :)

fdelapena commented 2 years ago

Related (fixes?): #169

Ghabry commented 2 years ago

it is still read twice. Not possible until the save data is also using DB String. Maybe 0.7.1 ;)

carstene1ns commented 2 years ago

lcftrans is broken now :P

Ghabry commented 2 years ago

yeah, everything that does encoding detection needs the API updated. Will provide a fix soon