gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Investigate Excel Date bug #27

Open timrobertson100 opened 7 years ago

timrobertson100 commented 7 years ago

From private email correspondence:

This email is mainly about a particular source of duplicate records, something originating from Microsoft for. If you already know about it and GBIF screens for it, that's fine. As shown below, some dupes may have slipped past you.

The Tasmanian Natural Values Atlas feeds ALA and currently has more than 1.1 million records there. The TNVA has many quality control issues. One of them was explained to me in a 2013 email by a TNVA person this way:

"There seems to have been a systematic error with the dates - many of them are out by exactly 4 years and one day (this is a known Excel issue http://support.microsoft.com/kb/180162 )".

I did a quick check today on ALA's export of TNVA records, and sure enough, there are lots of "Excel date duplicates" - same scientificName, same recordedBy, same decimalLatitude and decimalLongitude, but dates differing by 4 years and 1 day. The Excel bug shifts the date backwards in time, so there are (for example) 2 records of collections of the onychophoran Ooperipatellus cryptus by me dated 3 Feb 1976 and 2 Feb 1972 at -41.1121485215089, 144.957534187555. And I didn't arrive in Tasmania until 1973...

TNVA hasn't cleaned out the Excel duplicates, so they've gone to ALA. Here are those 2 records:

1972: https://biocache.ala.org.au/occurrences/1e633a0e-82a6-46d9-a7c4-46a99e2394cd 1976: https://biocache.ala.org.au/occurrences/94946259-785c-4f1a-ac14-a496ad4d719d

and in GBIF they're (gbifID)

1972: 1648019104 1976: 1647193398

Now for 2 further data stuff-ups.

The TNVA has this record as a humanObservation. It isn't, I actually collected the specimen and put it in the Queen Victoria Museum and Art Gallery, where the specimen lot was registered as 11:5164. Because "O. cryptus" was an unpublished name until the mid-1990s, the specimen was registered as "O. insignis", a catch-all taxon not actually found in Tasmania. Here's the record in ALA:

https://biocache.ala.org.au/occurrences/847c41f3-7208-44a8-a978-a367783c6018

and in GBIF:

1132260361

If you check this record, you'll see the date is 4 February 1976. This is the correct date. Both "2 Feb" and "3 Feb" are TNVA errors: Excel bug and data entry, respectively. So ALA and GBIF actually hold 3 records for the same sample: 2 with the wrong date and 1 with the wrong name.

OK, so it's a mess and needs cleaning up. In January 2013 I catalogued the Onychophora errors for TNVA and sent them a replacement list of 272 records. On 23 January 2013 the officer in charge went through the issues and emailed "I will try to get all of these corrected ASAP".

Meanwhile, it might be worth GBIF's while doing a global check for the Microsoft Excel dates bug and any duplicates it may have generated.

MattBlissett commented 6 years ago

FAO @jhnwllr.

(Not saying you do this analysis, but I doubt data products check issues in this repository very often.)