VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

Trap and remove non-printing characters in harvest #123

Closed tucotuco closed 9 years ago

tucotuco commented 10 years ago

To avoid corrupt files from the perspective of loading into BigQuery, trap and remove non-printing characters from the harvested fields. In particular ASCII 0x0 makes BigQuery load fail.

tucotuco commented 9 years ago

And ASCII 0x0 comes in Unicode (UTF16) when UTF8 is expected. For now, recommend fixing these at the source.