International-GNSS-Service / SLM

SiteLog Manager
https://igs-slm.rtfd.org
MIT License
9 stars 2 forks source link

Backwards compat for legacy parsers that break on UTF-8? #100

Open bckohan opened 2 months ago

bckohan commented 2 months ago

The new SLM encodes all output files in UTF-8 - this is the default in Python 3+. These files are indistinguishable from ASCII unless they have some 2-byte characters in them. The 2-byte characters referred to below are coming from the ISO-3166 country codes. Its pretty easy for us to downcode the handful of ISO-3166 country codes that contain 2-byte characters to their 1-byte equivalent - but unless we explicitly block submission of any 2 byte or 4 byte characters on any of the text fields the problem for 1-byte character only readers will still exist.

There's a decision to be made here. The first order decision is should we only allow 1-byte characters in any of our text data? If the answer is yes the implementation decisions get simpler (if not a little clunky b/c I'll be forcing python to do something its been evolved to not want to do) BUT we're also locking in all the other formats (GeodesyML and JSON) to the 1-byte character set. Given the I in IGS maybe we don't want to do that?

Here's the option set as I see it:

  1. Restrict all data to the 1 byte character set.
  2. Provide ascii-only renderings of the site log files. This could be an additional directory on the ftp - we could call it log_ascii or something. We can map the obvious utf-16 characters including all of the offending characters in the ISO-3166 codes back to ascii, but anything we cant map automatically we could render as an escape code.

I prefer option 2 for the following reasons: