BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

Cannot deal with archives with subdirectories #49

Closed nickynicolson closed 8 years ago

nickynicolson commented 8 years ago

Example archive: http://rs.gbif.org/datasets/german_sl.zip

This is the default archive used in the GBIF DWCA validator

It contains the following files:

nickyn@ubuntu:~/dwca$ unzip -l german_sl.zip
Archive:  german_sl.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2009-12-11 20:41   german_sl/
     6148  2009-11-04 16:11   german_sl/.DS_Store
   725969  2009-07-15 17:33   german_sl/distribution.txt
        0  2010-01-15 11:18   __MACOSX/
        0  2010-01-15 11:18   __MACOSX/german_sl/
      184  2009-07-15 17:33   __MACOSX/german_sl/._distribution.txt
     1374  2009-12-09 12:35   german_sl/eml.xml
     3195  2009-10-28 15:24   german_sl/meta.xml
      186  2009-10-28 15:24   __MACOSX/german_sl/._meta.xml
   272992  2009-07-15 16:16   german_sl/species_info.txt
  4149979  2009-10-28 15:39   german_sl/taxa.txt
   177967  2009-07-15 13:49   german_sl/vernacular.txt
---------                     -------
  5337994                     12 files

The dwca-reader unzips, but fails to find a meta.xml - as it is inside a subdirectory. The following error is produced:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-18-8b040c988d99> in <module>()
     13 
     14 
---> 15 with DwCAReader('german_sl.zip') as dwca:
     16     # We can now interact with the 'dwca' object
     17     print("Core type is: %s" % dwca.descriptor.core.type)

/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in __init__(self, path, extensions_to_ignore)
     83         #: An :class:`descriptors.ArchiveDescriptor` instance giving access to the archive
     84         #: descriptor (``meta.xml``)
---> 85         self.descriptor = ArchiveDescriptor(self._read_additional_file('meta.xml'),
     86                                             files_to_ignore=extensions_to_ignore)
     87 

/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in _read_additional_file(self, relative_path)
    163         """Read an additional file in the archive and return its content."""
    164         p = self.absolute_temporary_path(relative_path)
--> 165         return open(p).read()
    166 
    167     def _parse_metadata_file(self):

FileNotFoundError: [Errno 2] No such file or directory: '/home/nickyn/dwca/t/meta.xml'

Presumeably this is a valid archive - if so should the reader locate the meta.xml and continue relative to that location?

niconoe commented 8 years ago

Thanks for your report. I'm not sure if it is a valid archive or not (the standard is not always as clear as I'd like it to be), but it seems there are such archives in the wild and it shouldn't be too difficult to support, so I'll give it a try!

niconoe commented 8 years ago

I'm a little ambivalent about this issue.

To me, it looks like an error at GBIF to provide such a sample file for their DwCA validator. I'd be tempted to not fix it here (or just the single-directory simple case) and report it as an issue to GBIF. What do you think, @nickynicolson ?

nickynicolson commented 8 years ago

Thanks @niconoe - I agree. Re your first point: I've seen a lot of these single sub-directory archives in use - perhaps from IPT instances, but also from the Scratchpads project and emonocot. If we can jump into the subdir when only one subdir exists, that seems like a good solution.

I also agree that the sample DWCA referenced from the GBIF validator should be cleaner.

niconoe commented 8 years ago

That's good to know. I'll implement this "single subdir fix" so at least we support those common archives!