Anders429 / simfile

Apache License 2.0
0 stars 0 forks source link

Text Encoding #35

Open Anders429 opened 2 years ago

Anders429 commented 2 years ago

It seems that the standard for simfile encoding is UTF-8, but I haven't found any definitive specification of this. Some of these more obscure and old formats may use other encoding formats. I am fairly sure that .sul-based formats used some other kind of encoding, but I have yet to confirm that.

One artifact of interest is a comment in the SM5 .dwi loader code stating that there is no definitive spec for .dwi files, and that some files found are ISO-8859-1. Luckily, ISO-8859-1 is pretty easily converted to UTF-8. If DWI files are encoded differently, it is likely that MSD files are the same, although this again has not been confirmed.

For now things are assumed to be UTF-8. There is no way to deduce from the file alone what encoding format it uses, unfortunately, so the only way we can support additional encoding formats is if a list of encodings per file format is compiled. Ideally, this library will read and write files in their original encoding formats.

Anders429 commented 2 years ago

According to Chrome's charset assumption, a lot of these .msd files use the SHIFT_JIS charset for their encoding. It appears that Chrome is able to figure this out by using some kind of statistical charset detection.