gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Tajik dataset issues with "Publish first" #163

Closed rukayaj closed 6 months ago

rukayaj commented 8 months ago
  1. Some records have a value for the maximum elevation greater than the value for the minimum elevation. For example, in this record: https://www.gbif.org/occurrence/3924429834, the maximum elevation provided is 0 and the minimum is 160. See the list of records concerned here: https://www.gbif.org/occurrence/search?dataset_key=d4b0f477-0ddf-4c47-a1fe-a7ffed28788e&issue=ELEVATION_MIN_MAX_SWAPPED

  2. Nine records of the same dataset (DOI10.15468/ntdjg9) are flagged because the dates provided are unlikely. For example, the year provided for this record https://www.gbif.org/occurrence/3924428991 is "19", it is likely missing a century or a decade. See all the records flagged here: https://www.gbif.org/occurrence/search?dataset_key=d4b0f477-0ddf-4c47-a1fe-a7ffed28788e&issue=RECORDED_DATE_UNLIKELY

  3. Some records from the Khatlon Scientific Center dataset (DOI10.15468/q2y8b2) have invalid coordinates. For example, the longitude for this record is https://www.gbif.org/occurrence/4166474307 "н4.77" which our system cannot interpret. See the list of the 12 records concerned here: https://www.gbif.org/occurrence/search?dataset_key=4929d4f6-ea8a-40cc-ab3d-0e3a9da01a45&issue=COORDINATE_INVALID.

rukayaj commented 8 months ago

Linked to https://github.com/gbif-norway/helpdesk/issues/84

rukayaj commented 8 months ago

From discussion on 10/11 Oct:

We should rewrite this so it uses function calling in two conversations:

  1. System message: You are an expert herbarium label transcription system ... write out the verbatim Darwin Core terms (and the DwC terms which don't have verbatim alternatives) that it finds. [This could be a function call?]

  2. System message: Here are some Darwin Core terms. Can you run some sanity checks and extract some new terms from the verbatim terms, if they have values:

    • minimumElevationInMeters should be an integer and smaller than maximumElevationInMeters (also an integer)
    • Dates should make sense and should be in integer format or whatever
    • ? [This should force a function call to register_dwc taking a dict argument, with keys as verbatim dwc terms.
rukayaj commented 6 months ago

The actual datasets issues are fixed now and I'm working on the sanity checks, so I'm going to close this now.