BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
45 stars 21 forks source link

Get orphaned extension rows #68

Closed pieterprovoost closed 7 years ago

pieterprovoost commented 7 years ago

I need to do some validation on the archives I'm reading, and this includes checking if all extension rows have matching core rows. This PR adds a method DwCAReader.orphaned_extension_rows() which returns non-matching core IDs and orphaned extension row indices as follows:

{
    'extendedmeasurementorfact.txt': {
          u'Cruise68:Station593:EventSorbeSledge9887:Subsample16686_5': [11136],
          u'Cruise64:Station550:EventSorbeSledge9740:Subsample9878_4': [1413, 1414, 1415]
    },
    'occurrence.txt': {
        u'Cruise66:Station591:EventSorbeSledge9885:Subsample9959_2': [4928],
        u'Cruise66:Station589:EventSorbeSledge9883:Subsample9953_1': [4739],
        u'Cruise68:Station579:EventSorbeSledge9810:Subsample17412_3': [5977]
    }
}

There may be a more elegant implementation.

niconoe commented 7 years ago

Thanks a lot @pieterprovoost, glad that my work is useful!

I'm considering merging this PR, but I'll need at least some documentation (API, a "performance" warning stating that this method can be time and memory consuming, ...), some unit tests and an entry in the changelog.

Would you be interested in working on that? Otherwise, I'll give it a try but I can't promise when ;)

Thanks again

pieterprovoost commented 7 years ago

I'll see what I can do, may take a while as well :)

niconoe commented 7 years ago

Manually merged, thanks a lot!