SpeciesFileGroup / INHS-Insect-Collection-Data-Curation

An accesible issue tracker for reporting issues or requests with respect to INHS data quality.
1 stars 0 forks source link

2024-04-27 via Mesibov on GBIF data: eventDate contains invalid and malformed entries #70

Open mjy opened 2 months ago

mjy commented 2 months ago

E.g. "03-20", "07/07", "01-05/01-05", "1876-07-04/1876-07-04", "1876-01-01/1876-12-31"), and dateIdentified has incorrectly formatted ones (e.g. "1919-7-5"), so it's a challenge to cross-check these fields, but I found at least 5200 records with dateIdentified before eventDate, which is impossible.

occurrenceID | eventDate | dateIdentified 223bdd7c-8994-4a00-87a1-1858347e63c5 | 1999-06-23/1999-06-23 | 1998 94b6a73a-7de4-451f-9d41-83338b74340e | 2000-08-07/2000-08-07 | 1998 5204d8f6-efd1-400e-be2b-ad31c9f0a869 | 2000-07-12/2000-07-12 | 1998 e0b936c2-74a0-4a8c-a9f1-2f45a2c8fbfb | 2000-07-30 | 1998 cc543268-4780-41e8-99c2-fcd69127c894 | 2000-07-17/2000-07-18 | 1999 48a19e8e-d1e6-4536-bedd-200185018972 | 1999-07-26/1999-07-28 | 1998 5aba7156-ca0f-43d8-97b3-85daae13d6d6 | 1999-05-17/1999-05-17 | 1997 88034de0-ee69-4570-b875-73d4c5537aef | 1999-09-08/1999-09-10 | 1998 bfa8ef3b-63f4-48d8-b733-065a628c1ae0 | 1997-07-29/1997-07-29 | 1993 e6059991-1ca5-4585-9761-3d3289bd3333 | 1998-05-13/1998-05-14 | 1997 2b134e80-6533-4989-b4c1-94f8239cecd2 | 1999-07-22 | 1998

tmcelrath commented 2 months ago

Not sure how to find this in TW. Can we post a list/JSON query? @mjy

Mesibov commented 2 months ago

@tmcelrath, I suggested in my report that I could tidy up the misformatted dates and give you a complete list of records with this problem, if you wanted. How I find these is explained on my "Darwin Core checker" website here

mjy commented 2 months ago

@Mesibov I haven't fully grokked this issue, but one quick comment (also impacts lat/long export).

We have 7 date fields. verbatim_date and then the parsed start/end. Depending on whether or not the user has parsed verbatim_date you may/not get errored data exported. There may be some true errors in verbatim_date that are only going to get resolved in export if we parse them into the 6 field equivalent.

Mesibov commented 2 months ago

I'm happy to check date fields for consistency, but the issue here is a logical one. If you collected an insect in 1999 you could not have identified it in 1998.

Mesibov commented 2 months ago

@mjy, @tmcelrath , attached in a TSV are the 5249 date anomalies I found in which dateIdentified is earlier than the non-interval eventDate, or earlier than the "finish" date in an interval eventDate. For each record I give the "id" from the occurrence.txt table, the original eventDate, the tidied eventDate (see below), the original dateIdentified and the tidied dateIdentified (see below). I ignored the records with no year in eventDate and I haven't checked verbatimEventDate against eventDate.

A lot of these look like dateIdentified copy-down errors in a spreadsheet.

Tidying of eventDate (by example): 1875-02-06/1875-02-06 > 1875-02-06 1847-01-01/1847-12-31 > 1847 1997-01-01/1998-12-31 > 1997/1998 1877-06-01/1877-06-30 > 1877-06 1875-07-01/1875-07-31 > 1875-07

Tidying of dateIdentified (by example): 2023-5-8 > 2023-05-08 1941-1 > 1941-01

eventDate-dateIdentified-anomalies.txt

tmcelrath commented 2 months ago

Hey @Mesibov - as with the other issue, I need a .txt file with occurrenceID instead of id.

Mesibov commented 2 months ago

@tmcelrath, no problem, attached has id, occurrenceID and (if available) catalogNumber. Please note that many of these cases might be due to eventDate errors arising from the verbatimEventDate-to-eventDate problems

new-eventDate-dateIdentified-anomalies.txt

tmcelrath commented 2 months ago

Some of these are flagged with "Determination is preceding collection date" - we need to be able to search by that in TW.

mjy commented 2 months ago

... we need to be able to search by that in TW.

And it needs to be a hard validation.