OpenVicProject / OpenVic-Dataloader

Dataloader submodule for OpenVic that is responsible for parsing both Paradox Victoria 2 data files and custom OpenVic data files.
MIT License
9 stars 2 forks source link

Convert non-UTF8 encodings to UTF8 #43

Closed Spartan322 closed 4 months ago

Spartan322 commented 8 months ago

Loading v2script files currently does not convert the encoding to a common standard encoding which can display weird on some systems. It is necessary that v2script files produce a standard encoding so the file loading pipeline will produce consistent results, and alongside this we must support encodings like ASCII, Windows-1252, Windows-1251, and UTF8, so it makes the most sense then to convert non-UTF8 encodings to UTF8, which means we need to detect each particular encoding and then provide a kind of conversion database to UTF8, since for now we only plan to support Windows-1252 and Windows-1251 of the non-UTF8 compatible encodings, it is likely easier and cheaper for us to manage conversion on our own instead of seeking for a third-party library, however if the expectation is to support many more encodings instead of flatly rejecting them on detection, then it may become necessary to pull more from hsivonen/chardetng instead. will become necessary instead to integrate unicode/icu. (See icu4c)

See

Spartan322 commented 4 months ago

With a reduced C++ version hsivonen/chardetng implemented for #46, some encodings require more complex conversion strategies then simple character replacement, our current conversion implementation is wholly incapable of any Chinese or Japanese character conversions, (or similar encodings) thankfully these encodings seem to be unexpected and unlikely to work for Victoria 2 regardless so chances of such are functionally impossible. For other alphabetic encodings are still trivial to detect and convert based on hsivonen/chardetng but the chance we'll need more is generally considered low in priority as it is itself unlikely for Openvic to need to support any other encodings.