Windows: encoding issues when opening Metadata

niconoe commented 7 years ago

(issue reported by @DimEvil, problematic archive for tests: dwca-modirisk-monitoring-2-v3.5.zip)

When reading EML.xml, the file encoding is currently not specified, so Python use a system-dependent codec. This is generally utf-8 on *nix, but it is CP1252 on Windows (Python 3, installed from Anaconda) which triggers:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-28-7d0ed62699a1> in <module>()
      1 from dwca.read import DwCAReader
      2 
----> 3 with DwCAReader('dwca-modirisk-monitoring-2-v3.5.zip') as dwca:
      4    print("Core data file is: {}".format(dwca.descriptor.core.file_location)) # => 'occurrence.txt'
      5    core_df = dwca.pd_read('occurrence.txt', parse_dates=True)

C:\Users\dimitri_brosens\AppData\Local\Continuum\Anaconda3\lib\site-packages\dwca\read.py in __init__(self, path, extensions_to_ignore)
    100         #: A :class:`xml.etree.ElementTree.Element` instance containing the (scientific) metadata
    101         #: of the archive, or `None` if the archive has no metadata.
--> 102         self.metadata = self._parse_metadata_file()
    103 
    104         #: If the archive contains source-level metadata (typically, GBIF downloads), this is a dict such as::

C:\Users\dimitri_brosens\AppData\Local\Continuum\Anaconda3\lib\site-packages\dwca\read.py in _parse_metadata_file(self)
    389 
    390             try:
--> 391                 return self._parse_xml_included_file(filename)
    392             except IOError as exc:
    393                 if exc.errno == ENOENT:  # File not found

C:\Users\dimitri_brosens\AppData\Local\Continuum\Anaconda3\lib\site-packages\dwca\read.py in _parse_xml_included_file(self, relative_path)
    404     def _parse_xml_included_file(self, relative_path):
    405         """Load, parse and returns (as ElementTree.Element) XML file located at relative_path."""
--> 406         return ET.fromstring(self.open_included_file(relative_path).read())
    407 
    408     def _unzip_or_untar(self):

C:\Users\dimitri_brosens\AppData\Local\Continuum\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6919: character maps to <undefined>

Next steps to fix:

[ ] Check if the DwCA standard specifies the encoding of such files
[ ] If so, use it explicitly when calling open
[ ] Otherwise, use "https://pypi.python.org/pypi/file-magic/0.3.0" or similar, to detect the encoding before opening the file.
[ ] Write a regression test and make sure it works, also on Windows

niconoe commented 7 years ago

After discussion, it appears that, the metadata being an XML file, the encoding should be specified with an XML declaration at the start of the file, or should default to UTF-8 if nothing is specified.

I added two test cases (one with explicit windows-1252 encoding, and one for implicit UTF-8) and slightly changed to code so the XML parser manage this by itself.

niconoe commented 7 years ago

Fixed confirmed to work on Windows by @DimEvil. Closing.

BelgianBiodiversityPlatform / python-dwca-reader

Windows: encoding issues when opening Metadata #73