Closed MattBlissett closed 3 years ago
The commit improves the code (removing an encoding encoding assumption), and logs a warning if FileUtils is used where the default character set is ASCII.
I've also prevented the servers from accepting locale environment variables being set when accessed over SSH.
Reported in https://github.com/gbif/portal-feedback/issues/3191, but also affecting other datasets.
Mac OS sets the locale environment variable
LC_CTYPE=UTF-8
, which is not recognized on Linux. Linux would useen_US.UTF-8
or similar, or leave it unset and useLANG
.When Java starts up on Linux with the Mac OS
LC_CTYPE=UTF-8
, theCharsets.defaultCharset()
isUS-ASCII
. This causes problems wherever the default character set is used:System.out
, I/O streams without a specified character set, convenience classes like FileReader and FileWriter, etc.In the case above, a FileWriter is used to output sorted DWCA data. With the mixed environment variables, that leads to the file being written in ASCII, and corrupted data.
In other words, gbif-common assumes a correctly configured UTF-8 environment.