The new SLM encodes all output files in UTF-8 - this is the default in Python 3+. These files are indistinguishable from ASCII unless they have some 2-byte characters in them. The 2-byte characters referred to below are coming from the ISO-3166 country codes. Its pretty easy for us to downcode the handful of ISO-3166 country codes that contain 2-byte characters to their 1-byte equivalent - but unless we explicitly block submission of any 2 byte or 4 byte characters on any of the text fields the problem for 1-byte character only readers will still exist.
There's a decision to be made here. The first order decision is should we only allow 1-byte characters in any of our text data? If the answer is yes the implementation decisions get simpler (if not a little clunky b/c I'll be forcing python to do something its been evolved to not want to do) BUT we're also locking in all the other formats (GeodesyML and JSON) to the 1-byte character set. Given the I in IGS maybe we don't want to do that?
Here's the option set as I see it:
Restrict all data to the 1 byte character set.
Provide ascii-only renderings of the site log files. This could be an additional directory on the ftp - we could call it log_ascii or something. We can map the obvious utf-16 characters including all of the offending characters in the ISO-3166 codes back to ascii, but anything we cant map automatically we could render as an escape code.
I prefer option 2 for the following reasons:
Keeps the code base more simple.
Allows unrestricted character set usage while maintaining backwards compatibility for the handful of users who are effected by this and encourages those users to update their parsers.
The un-downcodable characters that will be mapped to escape codes are not going to be important to anything an automated reader is doing.
Is more in-line with how text based processing is evolving.
Site log upload should work with either utf-8 or ascii with escape codes.
The new SLM encodes all output files in UTF-8 - this is the default in Python 3+. These files are indistinguishable from ASCII unless they have some 2-byte characters in them. The 2-byte characters referred to below are coming from the ISO-3166 country codes. Its pretty easy for us to downcode the handful of ISO-3166 country codes that contain 2-byte characters to their 1-byte equivalent - but unless we explicitly block submission of any 2 byte or 4 byte characters on any of the text fields the problem for 1-byte character only readers will still exist.
There's a decision to be made here. The first order decision is should we only allow 1-byte characters in any of our text data? If the answer is yes the implementation decisions get simpler (if not a little clunky b/c I'll be forcing python to do something its been evolved to not want to do) BUT we're also locking in all the other formats (GeodesyML and JSON) to the 1-byte character set. Given the I in IGS maybe we don't want to do that?
Here's the option set as I see it:
I prefer option 2 for the following reasons: