Closed muttcg closed 2 years ago
Here is my current thinking and wishes for the new data validator:
Backend prioritized wishlist:
UI prioritized wishlist:
Additional issue flags :
I assume these additional issues could be handled separately, since validator will mirror pipelines processing. So we just need to add a feature to pipelines and it will be in the data validator...
Give priority in general to :
What (taxonomic information) Where (coordinate issues) When (date issues)
(Keep structural errors as top priority)
(And then within interpretation issues prioritize as follows)
HIGH
Coordinate out of range
Country coordinate mismatch
Zero coordinate
Taxon match higherrank
Presumed swapped coordinate
Presumed negated latitude
Presumed negated longitude
Country mismatch
Coordinate invalid
Recorded date invalid
Basis of record invalid
Taxon match none
MEDIUM
Taxon match fuzzy
Geodetic datum invalid
Geodetic datum assumed WGS 84
Country derived from coordinates
Country invalid
Coordinate reprojection suspicious
Coordinate reprojection failed
Coordinate uncertainty meters invalid
Coordinate precision invalid
Occurrence status unparsable
Occurrence status inferred from individual count
Occurrence status inferred from Basis Of Record
Individual count conflicts with occurrence status
Individual count invalid
LOW
Coordinate reprojected
Coordinate rounded
Continent invalid
Depth not metric
Elevation non numeric
Elevation min max swapped
Elevation not metric
Depth unlikely
Depth non numeric
Elevation unlikely
Continent country mismatch
Continent derived from coordinates
Recorded date mismatch
Identified date unlikely
Recorded Date Unlikely
Multimedia date invalid
Identified date invalid
Modified date invalid
Modified date unlikely
Georeferenced date invalid
Georeferenced date unlikely
Type status invalid
Ambiguous institution
Ambiguous collection
Institution match none
Collection match none
Institution match fuzzy
Collection match fuzzy
Institution collection mismatch
Different owner institution
References URI invalid
Multimedia URI invalid
Interpretation error
Depth min max swapped
Hi, I would like a feature where clustering behavior is reported. This is to help publishers avoid duplication or submitting records already in GBIF from other publishing organizations. It should at least give them an opportunity to see if clustering behavior is uncovering duplicates.
I would like a feature where clustering behavior is reported. This is to help publishers avoid duplication or submitting records already in GBIF from other publishing organizations.
This is understandable, but I'm afraid is not something we will be able to easily achieve. Currently, and for the foreseeable future, clustering runs as a batch process across all data.
TODO: Description