CR: interpret ICMT as UTF-8 if possible, not ISO-8859-1/Windows-1252

mirabilos commented 4 years ago

The SoundFont specification requires the various INFO chunks to be in 7-bit ASCII.

In my soundfonts, I bow to that in all fields but ICMT, where I wish to use things like the Copyright character U+00A9 ©. I use UTF-8 to do so, because that’s the current standard superceding ASCII while being fully compatible with it.

In Polyphone, this is rendered as mojibake Â© because Polyphone interprets this as either ISO-8859-1 or codepage 1252 (which is a superset of the former).

This means that Polyphone does not strictly limit the characters used in these chunks to ASCII either, but uses a legacy 8-bit encoding (which is not capable of representing e.g. the names of Japanese contributors to soundfonts).

The change I’m requesting is as follows: if the ICMT chunk of a soundfont parses as UTF-8 with no errors (invalid octets, incomplete sequences, nōn-minimal encoding, all chars ≤U-001FFFFF, etc.) then Polyphone treats it as UTF-8 (this does include where the ICMT chunk is empty or ASCII, i.e. also for new soundfonts). Only if it does not cleanly parse as UTF-8 should Polyphone interpret/display/save it as codepage 1252.

It might also make sense to either restrict all other chunks to 7-bit ASCII (given other programs may be picky in what they accept) or use the same rule of encoding for them and warn, either upon editing the field or upon saving, that they contain nōn-ASCII octets. This be the second half of my CR, but I don’t presume requesting which one, just that one of these two be implemented.

davy7125 commented 2 years ago

Specific process for the ICMT data:

when reading it, it is considered to be UTF-8 first, Latin1 otherwise
when writing, it is considered as Latin1 if possible, UTF-8 otherwise

I will also remove the special characters in all other fields. Accents will still be possible, according to the Latin1 character set.

mirabilos commented 2 years ago

Hi Davy,

Specific process for the ICMT data:

when reading it, it is considered to be UTF-8 first, Latin1 otherwise

when writing, it is considered as Latin1 if possible, UTF-8 otherwise

thanks, that sounds sensible.

I will also remove the special characters in all other fields. Accents will still be possible, according to the Latin1 character set.

Hm. The standard says ASCII there. (Well, it does for all fields, but ICMT is the least problematic one I think.) Maybe warn if nōn-ASCII characters (including latin1 accented etc) are present?

bye, //mirabilos -- 22:20⎜ The crazy that persists in his craziness becomes a master 22:21⎜ And the distance between the craziness and geniality is only measured by the success 18:35⎜ "Psychotics are consistently inconsistent. The essence of sanity is to be inconsistently inconsistent

davy7125 commented 2 years ago

This is only my opinion but I would prefer to keep the Latin1 that is an extension to the ascii character set and wait if there is a complain somewhere, no need to add a warning for now. With Viena the different accents are correctly displayed and it would be nice to test with Swami as well

davy7125 / polyphone

CR: interpret ICMT as UTF-8 if possible, not ISO-8859-1/Windows-1252 #121