gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Data validator #542

Closed muttcg closed 2 years ago

muttcg commented 3 years ago

TODO: Description

jhnwllr commented 3 years ago

Wish list

Here is my current thinking and wishes for the new data validator:

Backend prioritized wishlist:

  1. Output downloadable report of issues found.
  2. Any size archives would be nice (not just 100 mb). 3. Somehow runnable locally as well as on GBIF would be a bonus. (I assume this is not possible because reliance on eternal services anyway like the geocoder ect...)

UI prioritized wishlist:

  1. Previous validations available on user pages.
  2. Should give users issues by order of importance.
  3. What, Where, When flags given priority.
  4. Generate a map of record positions
  5. Give some possible solutions of suggested fixes

Additional issue flags :

  1. Centroid flagging (medium)
  2. Gridded dataset flag (medium)
  3. Any zero coordinate flagging (easy)
  4. Default coordinate uncertainty in meters (easy)
  5. is_cluster flag (?)
  6. is_outlier flag (hard)
  7. is_sensitive_species (hard)
  8. no_higher_taxonomy flag (easy)
  9. null_uncertainty_in_meters (easy)
  10. taxon_outlier_in_dataset (hard)

I assume these additional issues could be handled separately, since validator will mirror pipelines processing. So we just need to add a feature to pipelines and it will be in the data validator...

UI prioritization of flags

Give priority in general to :

What (taxonomic information) Where (coordinate issues) When (date issues)

(Keep structural errors as top priority)

  1. Resource Structure
  2. Record Structure

(And then within interpretation issues prioritize as follows)

  1. GBIF Occurrence Interpretation issues:

HIGH Coordinate out of range Country coordinate mismatch Zero coordinate Taxon match higherrank Presumed swapped coordinate Presumed negated latitude Presumed negated longitude Country mismatch Coordinate invalid Recorded date invalid Basis of record invalid Taxon match none MEDIUM Taxon match fuzzy Geodetic datum invalid Geodetic datum assumed WGS 84 Country derived from coordinates Country invalid Coordinate reprojection suspicious Coordinate reprojection failed Coordinate uncertainty meters invalid Coordinate precision invalid Occurrence status unparsable Occurrence status inferred from individual count Occurrence status inferred from Basis Of Record Individual count conflicts with occurrence status Individual count invalid LOW Coordinate reprojected Coordinate rounded Continent invalid Depth not metric Elevation non numeric Elevation min max swapped Elevation not metric Depth unlikely Depth non numeric Elevation unlikely Continent country mismatch Continent derived from coordinates Recorded date mismatch Identified date unlikely Recorded Date Unlikely Multimedia date invalid Identified date invalid Modified date invalid Modified date unlikely Georeferenced date invalid Georeferenced date unlikely Type status invalid Ambiguous institution Ambiguous collection Institution match none Collection match none Institution match fuzzy Collection match fuzzy Institution collection mismatch Different owner institution References URI invalid Multimedia URI invalid Interpretation error Depth min max swapped

jlegind commented 3 years ago

Hi, I would like a feature where clustering behavior is reported. This is to help publishers avoid duplication or submitting records already in GBIF from other publishing organizations. It should at least give them an opportunity to see if clustering behavior is uncovering duplicates.

timrobertson100 commented 3 years ago

I would like a feature where clustering behavior is reported. This is to help publishers avoid duplication or submitting records already in GBIF from other publishing organizations.

This is understandable, but I'm afraid is not something we will be able to easily achieve. Currently, and for the foreseeable future, clustering runs as a batch process across all data.