AtlasOfLivingAustralia / ala-downloads

Data downloads
https://downloads.ala.org.au
1 stars 4 forks source link

Darwin Core export: character encoding issues #19

Open Mesibov opened 6 years ago

Mesibov commented 6 years ago

This issue goes beyond Darwin Core export, because ALA does not check for and correct the errors that arise when text is passed through various encodings before getting to ALA. It creates replacement characters, question marks and gibberish strings that once were non-English author names or latitude/longitude strings. One of the ca 500000-records sets I downloaded on 3 August 2017 (UTF-8 encoded) contained the following non-standard control characters:

vertical tab control (octal 302 200) private use 2 (octal 302 222) set transmit state (octal 302 223) start of string (octal 302 230) control (octal 302 231) single character introducer (octal 302 232) application program command (octal 302 237)

Two of these characters are booby traps because they interrupt string processing and the processing program waits for a string-termination character that never appears.