ImperialCollegeLondon / safedata_validator

Python tools to validate and publish datasets using the safedata metadata format.
https://safedata-validator.readthedocs.io/
MIT License
2 stars 4 forks source link

Exception with GBIF 2016-07-25 database #74

Closed davidorme closed 1 year ago

davidorme commented 1 year ago

The data for the GBIF taxon backbone from 2016 differs from more recent simple backbone dumps in using empty strings rather than the postgres default \N for null values. For taxon hierarchy keys (genus_key etc) this is a problem because the import function only converts \N to SQLite null values. When a given taxon is processed, this results in taxon keys for inapplicable more nested taxonomic levels coming in as an empty string rather than None. The higher taxon validation then includes a bunch of entries like ["species", ""] for those inapplicable levels, rather than filtering them out and the id lookup raises an Exception.

We could fix this by creating a special case within the GBIF database building code. It isn't ok to simply substitute all empty strings with None when building GBIF backbone databases, because that should only be applied to the taxon key fields (and not actual string fields). So, we would have to detect the 2016 dataset being built and apply updates to a named set of fields.

I've solved this more simply here by simply adding empty strings as a condition marking an inapplicable taxon key as well as None. The PR also includes the addition of a devtool directory with an example script for debugging an entry point function, with a mechanism for passing command line arguments in to argparse.

codecov-commenter commented 1 year ago

Codecov Report

Merging #74 (7a78564) into release/3.0.0 (5702281) will not change coverage. The diff coverage is 100.00%.

@@              Coverage Diff               @@
##           release/3.0.0      #74   +/-   ##
==============================================
  Coverage          69.00%   69.00%           
==============================================
  Files                 12       12           
  Lines               3639     3639           
==============================================
  Hits                2511     2511           
  Misses              1128     1128           
Impacted Files Coverage Δ
safedata_validator/taxa.py 87.65% <ø> (ø)
safedata_validator/field.py 93.48% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more