gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Oddities in Fishes of Texas dataset #5026

Open sformel-usgs opened 10 months ago

sformel-usgs commented 10 months ago

Communicated to US node manager via email:

I'm finding a lot of oddities in the Fishes of Texas dataset related to an analysis I'm trying to do. First they have a couple records where the individual count in the hundred thousands (this one and also this one). It looks like they are stocking records from Texas Parks and Wildlife Dept but they still seem weird to me. Then I found several records (here's one) where the event date is 1716-01-01 which is interpreted from the original collection date 1716-01-01/2017-12-31 and the observation comes from the Tulane Museum of Natural History and from OBIS (but the OBIS url is an old iobis one).

Mesibov commented 10 months ago

@sformel-usgs, something odd indeed seems to have happened. This Tulane resource has apparently the same species, count and catalog number as this Fishes of Texas record, but unlike the FoT record it gives

locality = Gulf of Mexico off mouth of Oyster Bayou decLat/decLon = 29.21445/-91.13111 eventDate = 1952-01-11 recordedBy = A. Carrere

[Edited for incorrect link to Tulane resource]

sformel-usgs commented 10 months ago

@Mesibov thanks for taking a look at this. I wasn't able to find the Texas or Tulane people easily on github, so I'm reaching out to them via email.

sformel-usgs commented 10 months ago

I did a little more looking and found a few more. There are at least 9 occurrences for Syngnathus louisianae that give a data range of 1716 - 2017 for eventDate.

I contacted Dean Hendrickson from the Fishes of Texas POCs and received a prompt reply:

Thanks! We try to catch things like that, but know we miss some, and are happy to have them pointed out to us. I'm cc'ing Adam Cohen, who is the primary data manager for Fishes of Texas, so that he can add these to the list.

I'll update this after I hear more from Adam.

Mesibov commented 10 months ago

@sformel-usgs, thanks for the advice. Please note that I've corrected the link above for the Tulane resource with full data - sorry!

sformel-usgs commented 7 months ago

Response from Adam Cohen:

Sorry for taking so long to get on this. I back burnered it knowing it’d take me a while to go through. Thanks for all this research into the FoTX data on GBIF. Like Dean says, we are happy to find and correct any errors. I copied the text below from your Github post and discuss each.

_“several records (here's one) where the event date is 1716-01-01 which is interpreted from the original collection date 1716-01-01/2017-12-31 and the observation comes from the Tulane Museum of Natural History and from OBIS (but the OBIS url is an old iobis one).”._

In our project we manage begin and end dates as separate fields. Since Darwin Core has the single “event Date” field we adopted the slash-separated date range for all of our records as per the Darwin Core Quick Reference Guide examples. When we lacked dates entirely we were forced to assign ranges that included the full range we assumed possible. In this case the date range is very large since we had little/no information to confidently justify a smaller range. When lacking data we often used the oldest and most recent dates in the dataset to assign a range. In some cases, when we had a collector name, we used lifespans to justify smaller ranges.

Mesibov notes that a date, locality string and collector name exists for this record (TU 5557), citing the record elsewhere online. So I’ll make a note to update our data. I believe our lack of data for these fields originates from the TU data we obtained from OBIS on March 28, 2018. I just checked our downloads and see many of the Tulane records lacking dates and none with locality strings.

“a couple records where the individual count in the hundred thousands (this one and also this one). It looks like they are stocking records from Texas Parks and Wildlife Dept but they still seem weird to me.“

These are indeed stocking records from TPWD. Not sure what is wrong with this record if anything.

sformel-usgs commented 7 months ago

My response:

What you've said makes sense wrt the data ranges and the stocking records. I think the stocking events may have just seemed strange to the user because they hadn't encountered that type of event before (TBH, as a non-fish person, I hadn't either). The flag was the orders of magnitude difference between those occurrences and the other occurrences in the dataset.

One way you might draw some attention to that stocking context is by calling it out as a stocking record through the eventType term. Otherwise, I think the word "stocking" onyl appears as part of the catalog number.

As for the date ranges, I know people have been interested in a temporal uncertainty term for a while. But, in the absence of that, perhaps adding an eventRemark that explains the uncertainty could help. Something like, "Date range based on full possible range given the lack of temporal information."

I'm not under the illusion that either of my suggestions will make huge differences for most users, but if they're not hard to implement, then they might clarify the data a bit more for folks who have less familiarity with these fish data.

MattBlissett commented 7 months ago

As for the date ranges, I know people have been interested in a temporal uncertainty term for a while. But, in the absence of that, perhaps adding an eventRemark that explains the uncertainty could help. Something like, "Date range based on full possible range given the lack of temporal information."

We have fixed the longstanding issue about GBIF's handling of date ranges in the eventDate term, and now accept them. Some minor cases still need improvement (e.g. when eventDate has a year+month+day, the year field matches, but month and/or day are empty) but in general it's a big improvement.

The record from above now shows 1716-01-01/2017-12-31 in the interpreted view: https://www.gbif.org/occurrence/4104824611

sformel-usgs commented 7 months ago

Thanks for pointing that out @MattBlissett. I would still find it comforting to have an explanation somewhere for such a wide-ranging event for a single fish. Basically, something to assure me that these unusual values aren't an accident.

Mesibov commented 7 months ago

Here is the original Tulane record from https://fishair.org/ipt/resource?r=tu_fish for https://www.gbif.org/occurrence/4104824611:

id | collectionID | institutionCode | collectionCode | basisOfRecord | occurrenceID | catalogNumber | recordedBy | individualCount | occurrenceStatus | eventDate | islandGroup | island | country | locality | decimalLatitude | decimalLongitude | coordinateUncertaintyInMeters | scientificNameID | scientificName | family

FN2-55834896-19-TU | 19 | TU | Fish | PreservedSpecimen | FN2-55834896-19-TU | 5537 | A. Carrere | 1 | present | 1952-01-11 | | | United States of America | Gulf of Mexico off mouth of Oyster Bayou. | 29.21445 | -91.13111 | | urn:lsid:marinespecies.org:taxname:159453 | Syngnathus louisianae | Syngnathidae

The date range issue is interesting but if the record had been correctly imported into the FOTX dataset, the date 1952-01-11 would have appeared in both FOTX https://www.fishesoftexas.org/specimen/TU_5537 and GBIF.

Adam Cohen writes: "Mesibov notes that a date, locality string and collector name exists for this record (TU 5557), citing the record elsewhere online. So I’ll make a note to update our data. I believe our lack of data for these fields originates from the TU data we obtained from OBIS on March 28, 2018. I just checked our downloads and see many of the Tulane records lacking dates and none with locality strings."

This case is a good example of a minor data quality issue in a single record suggesting that something major has gone wrong in a data pipeline, that more than one record is likely to have been affected, and that the fix needs to start at the pipeline problem, not downstream.