Character set regression: Windows 1250/Latin-2/ISO-8859-2 missing

bkil commented 2 years ago

We would like to continue to edit files coded in Windows 1250/Latin-2/ISO-8859-2 encoding. This is not possible since replacing the character set detector (and such files are detected as Window 1252 with accents appearing incorrectly):

https://github.com/billthefarmer/editor/commit/4898bdc08453f8b9b7f178aa491973739a618772

Here are the supported sets in ICU:

At a minimum, it should be allowed to manually change the character set freely. At present, it is only possible to select a character set manually for which automatic detection is also implemented:

In the future, it would be ideal to use both detectors and perhaps combine their prediction and/or allow to use character sets that either one could detect.

billthefarmer commented 2 years ago

I stopped using the ICU charset detector because it was unreliable, and changed to the Mozilla based detector. This appeared to work correctly for the charsets I tested. It would be difficult to combine the two because they have different APIs. The Charset class in android provides many more charsets than either detector support. I note that Windows 1250/Latin-2/ISO-8859-2 encoding is not a standard charset as defined in StandardCharsets.

A solution to this problem would appear to be to remove detection altogether and use the charsets provided by the Charset class. However this is likely to trip up some users and lead to yet more issues.

bkil commented 2 years ago

I would suggest a new option in the settings:

Character set of opened files:
- Automatically detected
- Default to: ... (this submenu would contain everything that availableCharsets() returns)

Right now, every document on the device is encoded with the same character set, so it would also be tedious for us to switch back to this upon every opening if automatic detection would not work.

So I think it would be an acceptable bug to have if the present manual override menu would only contain the intersected set as it is implemented right now, because we would not be using it.

bkil commented 2 years ago

After reviewing the support between the old and the new library, the regression concerns the following character sets:

https://en.wikipedia.org/wiki/ISO/IEC_8859-2 Latin-2 (Albanian, Bosnian, Croatian, Czech, German, Hungarian, Polish, Serbian Latin, Slovak, Slovene, Upper Sorbian, Lower Sorbian, Turkmen)
https://en.wikipedia.org/wiki/ISO/IEC_8859-6 Latin (Arabic)
https://en.wikipedia.org/wiki/ISO/IEC_8859-9 Turkish
https://en.wikipedia.org/wiki/Windows-1256 Arabic (Persian, Urdu)

billthefarmer / editor

Character set regression: Windows 1250/Latin-2/ISO-8859-2 missing #169