Datafable / gbif-dataset-metrics

Get insights in GBIF-mediated datasets with charts and metrics.
https://chrome.google.com/webstore/detail/gbif-dataset-metrics/kcianglkepodpjdiebgidhdghoaeefba
MIT License
10 stars 1 forks source link

Create date quality categories #24

Open peterdesmet opened 9 years ago

peterdesmet commented 9 years ago

Description

For a given dataset, I want to know how many records have dates. I also want to know how many of those are useful, have issues, and maybe what their precision is. I envision this as a bar chart, where the records are grouped in categories based on the quality of the dates.

Categories (in order of increasing data quality)

eventDate
issue
eventDate from verbatim.txt
verbatimEventDate
year
month
day

Process

IF eventDate != "" AND issue DOES NOT CONTAIN (
        RECORDED_DATE_MISMATCH
    )
    THEN category = "Valuable date (all in ISO8601)" /* Well, MM-DD-YYYY are still in there */
ELSEIF issue CONTAINS (
        RECORDED_DATE_MISMATCH /* The only issue that keep eventDate populated */
        )
        verbatim.txt.eventDate != "" /* Since GBIF empties eventDate (see #27) in occurrence.txt, 
            we'd have to look in verbatim.txt :( */
        OR verbatimEventDate != ""
        OR year != ""
        OR (year != "" AND month != "")
        OR (year != "" AND month != "" AND day !="")
    /* A date was provided */
    THEN category = "Date provided, but not interpreted by GBIF"
ELSE
    category = "Date not provided"
peterdesmet commented 9 years ago

The pretty useless process if we just use GBIF issues:

IF issue CONTAINS (
        RECORDED_DATE_INVALID
        RECORDED_DATE_MISMATCH
        RECORDED_DATE_UNLIKELY
    )
   THEN category="Date with issues"
ELSEIF eventDate != ""
   THEN category="Valuable date (all in ISO8601)"
ELSE
   THEN category="Date not provided" /* This is just incorrect! See issue #27 */
peterdesmet commented 9 years ago

@bartaelterman, @niconoe, I need your feedback on this issue:

  1. We need to look in verbatim.txt to get a useful eventDate (as GBIF overwrites them without warning in occurrence.txt, see #27 - need to confirm with them that no field in occurrence.txt has the original eventDate). If so, how challenging is it to loop over that file too?
  2. Do we use the Canadensys Narwhal processor to provide high quality categories, instead of the current basic ones?
  3. Or do we not tackle this issue in this POC?