BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

DwCAReader: line truncated at UTF-8 EOL char #20

Closed niconoe closed 10 years ago

niconoe commented 10 years ago

When iterating over lines, it goes to next line prematurely when encountering an UTF8-EOL character (charbase.com/0085-unicode-next-line-nel). Issue similar to: http://stackoverflow.com/questions/16227114/utf-8-files-read-in-python-will-line-break-at-character-x85.

Given the description of this utf byte, it does make sense. However, since the EOL character is specified in meta.xml, we decided that it makes sense (and make DwCAReader more resilient) to ignore it in this case.

The issues was discovered when playing with a sample export from the new GBIF data portal. The "issue" has also been fixed on their side, so this portal will probably not generate such exports in the future.