SpeciesFileGroup / INHS-Insect-Collection-Data-Curation

An accesible issue tracker for reporting issues or requests with respect to INHS data quality.
2 stars 0 forks source link

(Some) century errors in eventDate #79

Open Mesibov opened 4 months ago

Mesibov commented 4 months ago

It's not easy to spot century errors in eventDate because there are several different (possible) sources of error. One usually reliable method is to look for eventDate entries which don't agree with the lifespans of people in recordedBy. This relies on correct identification of the persons and can be problematic, even with full names. (Imagine several "George Edward Smith" collectors over 150 years.)

If recordedByID is correct, the job is a bit simpler. Here are 3 records found with recordedByID in which eventDate and/or verbatimEventDate are clearly off by 100 years:

occurrenceId | catalogNumber | eventDate | verbatimEventDate | recordedBy b94abb1e-cb45-4752-876d-c5a996ce9d4e | INHS Insect Collection 1500725 | 1909-10-14/1909-10-14 | 14.X.09 | Taylor, Steven J. ff6a5e6c-37e6-4a42-81e6-af10fffe3ab7 | INHS Insect Collection 1011610 | 1897-12-20/1897-12-20 | 20-XII-97 to 10-I-1998 | Barria, G. | Irwin, Michael E. d253c3f7-947a-4fa6-b0f7-5375d034b6c8 | INHS Insect Collection 938279 | 1906-01-11/1906-01-11 | 11 January 1906 | Irwin, Michael E. | Webb, Donald Wayne | Schlinger, Evert I.

Doing a similar check on name strings in recordedBy is much less profitable because names in databases are not usually normalised and can have many forms. All of these variations on my fictitious collector are in your insects database:

George Edward Smith George E. Smith G.E. Smith GE Smith GESmith G.E. S. G.E.S. GES Smith, George Edward Smith, George E. Smith, G.E. Smith, GE Smith GE Smith etc

I did a little bit of normalising (with regex) before giving up in despair! The recordedBy name strings in the attached possible-century-errors.txt all have eventDate spans of at least 100 years (an arbitrary choice). I've only checked my acquaintances Jeff Skevington and Gail Kampmeier and found these "out of bounds" records:

occurrenceId | catalogNumber | eventDate | verbatimEventDate | recordedBy ed934649-4147-4b98-9871-59f619d752e4 | INHS Insect Collection 286513 | 1887-08-12 | | J. Skevington 6efb1727-403e-48bb-b09d-7d32011fc244 | INHS Insect Collection 739575 | 1894-05-31 | | G.E. Kampmeier 1e4afd29-414a-423a-b6a1-d805b68ea478 | INHS Insect Collection 728616 | 1894-06-03 | | G.E. Kampmeier

Please understand that the possible-century-errors.txt list is seriously incomplete because of name variants. It will be much harder to locate existing century errors in your database than it will be to prevent them in future when digitising, and for labels with "G.E.S."-type abbreviations the data will fall into "probably" and "who knows?" classes. Bionomia can help a lot with the legacy issues, but that's scholarly detective work that coders like me can't help you with.

possible-century-errors.txt