SpeciesFileGroup / INHS-Insect-Collection-Data-Curation

An accesible issue tracker for reporting issues or requests with respect to INHS data quality.
1 stars 0 forks source link

verbatimEventDate issues #78

Open Mesibov opened 1 month ago

Mesibov commented 1 month ago

Some apparently valid verbatimEventDate (vED) entries do not have a corresponding eventDate (eD). These are listed with their id's in the attached valid-vED-no-eD.txt.

I also found 239 disagreements between vED and eD in 993 records where both fields have entries. These are listed by disagreement (sorted by vED, with number of records for each) in vED-eD-disagreements-by-type.txt, and by record in vED-eD-disagreements-by-record.txt (sorted by record id).

I've excluded from vED-eD disagreements all the various formatting variations and doubtful constructs, such as

Formatting variations have greatly increased the number of eDs for the same unique vED, as in this example:

No. of records | verbatimEventDate | eventDate 9 | 20-July-1999 | 1999-07-20/1999-07-20 3 | 20-July-1999 | 1999-07-20/1999-07-21 #error? 2 | 20-July-1999 | 1999-07-20 2 | 20-July-1999 | 1999-06-20/1999-06-20 #apparent error 5 | 20-July-1999 | 1999-07-19/1999-07-21 #error?

Two sorts of ambiguities affect vED-eD relationships. One is that both DMY and MDY constructions appear in vED. Where these look questionable I've included them in the disagreements list.

The second ambiguity is that where only 2 digits are used to designate years in vED, the eD may have the wrong century. The eD compiler may have used formatting or information from other fields to decide on a century, so I've trusted the resulting eDs. To take just one example of many, from vED alone it's hard to understand why these two vEDs are assigned to different centuries:

id | vED | eD 6677199 | 1-VII-90 | 1890-07-01 7031601 | 1-VIII-90 | 1990-08-01

I'll look at possible "century errors" in a later issue here.

Finally, these 3 records have the invalid verbatimEventDate "31.ii.2021":

id | verbatimEventDate | recordedBy | verbatimLatitude | verbatimLongitude 6664453 | 31-ii-2021 | JLR MAR | 42.442 | -88.23489 6664454 | 31-ii-2021 | JLR MAR | 42.442 | -88.23489 6664469 | 31-ii-2021 | JLR MAR | 42.44200 | -88.23489

but they look to be part of a series and the lat/lons match for 31 March 2021:

... 6664550 | 30-III- 2021 | JLR MAR | 41.30246 | -89.03896 6664379 | 31-iii-2021 | JLR, MAR | 42.44200N | -88.23489W 6664445 | 1-IV-2021 | JLR MAR | 41.33954 | -89.04424 ...

valid-vED-no-eD.txt

vED-eD-disagreements-by-record.txt

vED-eD-disagreements-by-type.txt

tmcelrath commented 1 month ago

Hey Bob - I don't know what "id" field you are using, but I can't find anything in our system using it. That might be a GBIF ID, which we don't recognize. If you can use the "catalogNumber" field instead that would be helpful to me.

tmcelrath commented 1 month ago

The other field you could use is the occurrenceID, which we also use/generate.

tmcelrath commented 1 month ago

Sorry to make you regenerate the files but I can't use them as is.

Mesibov commented 1 month ago

@tmcelrath, no problem, files attached with added occurrenceID and (if available) catalogNumber. Revised message text:

...

These 3 records have the invalid verbatimEventDate "31.ii.2021":

id | verbatimEventDate | recordedBy | verbatimLatitude | verbatimLongitude

but they look to be part of a series and the lat/lons match for 31 March 2021:

... 6664550 | 30-III- 2021 | JLR MAR | 41.30246 | -89.03896 | 6bbc1005-bd8a-4977-b231-838725051da3 | INHS Insect Collection 932592 6664379 | 31-iii-2021 | JLR, MAR | 42.44200N | -88.23489W | 468adfa7-7d11-4698-a290-ad12de8a70aa | INHS Insect Collection 932519 6664445 | 1-IV-2021 | JLR MAR | 41.33954 | -89.04424 | 61418a3a-cd5a-4a6f-b51f-c596a210e9cf | INHS Insect Collection 932525"

new-valid-vED-no-eD.txt new-vED-eD-disagreements-by-record.txt