ShapeChange / ShapeChange

Processing application schemas for geographic information
https://shapechange.github.io/ShapeChange/
GNU General Public License v3.0
19 stars 11 forks source link

Encoding of temporary file for feature catalogue #409

Open heidivanparys opened 3 months ago

heidivanparys commented 3 months ago

The temporary file from which a feature catalogue is generated

  1. is apparently created with the encoding of the model, see https://github.com/ShapeChange/ShapeChange/blob/c9cbb3888b7378599375410198f1b7d9992dbfec/src/main/java/de/interactive_instruments/ShapeChange/Target/FeatureCatalogue/FeatureCatalogue.java#L479
  2. and for EA models, the model encoding is actually hardcoded to Windows-1252, see https://github.com/ShapeChange/ShapeChange/blob/c9cbb3888b7378599375410198f1b7d9992dbfec/src/main/java/de/interactive_instruments/ShapeChange/Model/EA/EADocument.java#L87

Would it be an option to always generate those temporary files with the UTF-8 encoding?

The hardcoded encoding does not match reality any more after the upgrade to EA 16, with the new .qea format, see also https://sqlite.org/pragma.html#pragma_encoding (for the model I tested, it was set to UTF-8).

jechterhoff commented 3 months ago

I wonder if there is any way within the EA automation interface to determine the encoding ...

jechterhoff commented 3 months ago

Have not found anything in the EA API documentation. Let's see what a support request at Sparx Systems reveals (I contacted them). It is unclear if the encoding is always the same or if it can change (e.g. for a cloud based model in a Postgres-DB with certain encoding). To be on the safe side, we may need to add a parameter to specify the model encoding. That could be an input parameter, although we have model loading in other places as well. Presumably, EA models used in a single ShapeChange workflow use the same encoding.

jechterhoff commented 2 months ago

We have a reply from Sparx Systems support:

The code page a given database stores strings in isn't relevant. All strings returned by the API are encoded in UTF-16.

The only time the encoding of the database should be relevant to you is the .eap file format (which you can't use on 64 bit at all) because that format does not support unicode and instead uses the Code Page for non-Unicode applications in Windows itself.