BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

Windows line endings read as data #48

Closed nickynicolson closed 8 years ago

nickynicolson commented 8 years ago

Data are read OK, but as the line ending '\r' is included as a component of the data value no extensions are found from the rows of the core data-file.

Data: {'http://rs.tdwg.org/dwc/terms/taxonID': 'urn:ipni.org:name:77126806-1\r'}
----------------------------------------------------------------------------^
niconoe commented 8 years ago

Thanks for your report, @nickynicolson!

Do you have a failing archive a hand so I can test/fix from it?

niconoe commented 8 years ago

Ok, I received the test archive and it is clearly invalid:

The data file contains a single data column, followed by \r\n. Given it's a single-column file, that can either be interpreted as:

However, the metafile mentions fieldsTerminatedBy="\t" linesTerminatedBy="\n", so the \r character is not removed.

If we fix the metafile to match the scenarios above ( fieldsTerminatedBy="\r" linesTerminatedBy="\n" or fieldsTerminatedBy="\t" linesTerminatedBy="\r\n"), the data appears correctly.

I'd be tempted to say "wontfix", since I can't see a way to support such invalid archives without breaking other stuff (dropping \r at end of fields without the metafile doesn't seems wise).

@nickynicolson, do you agree?