gbif / gbif-common

Utility classes
Apache License 2.0
1 stars 1 forks source link

Potential for encoding corruption, depending on the environment #21

Closed MattBlissett closed 3 years ago

MattBlissett commented 3 years ago

Reported in https://github.com/gbif/portal-feedback/issues/3191, but also affecting other datasets.

Mac OS sets the locale environment variable LC_CTYPE=UTF-8, which is not recognized on Linux. Linux would use en_US.UTF-8 or similar, or leave it unset and use LANG.

When Java starts up on Linux with the Mac OS LC_CTYPE=UTF-8, the Charsets.defaultCharset() is US-ASCII. This causes problems wherever the default character set is used: System.out, I/O streams without a specified character set, convenience classes like FileReader and FileWriter, etc.

In the case above, a FileWriter is used to output sorted DWCA data. With the mixed environment variables, that leads to the file being written in ASCII, and corrupted data.

In other words, gbif-common assumes a correctly configured UTF-8 environment.

MattBlissett commented 3 years ago

The commit improves the code (removing an encoding encoding assumption), and logs a warning if FileUtils is used where the default character set is ASCII.

I've also prevented the servers from accepting locale environment variables being set when accessed over SSH.