gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Issues raised in "An audit of some processing effects in aggregated occurrence records" #62

Open MattBlissett opened 6 years ago

MattBlissett commented 6 years ago

Bob Mesibov's paper, https://doi.org/10.3897/zookeys.751.24791

I'm leaving the taxonomy issues for later (or for @mdoering), but looking at the other issues in turn.

The GBIF NZAC dataset had 1186 pairs of duplicate records, in each case with one record with modified date “2016-11-11” and the other with “2017-05-09” (see Results for a likely explanation). I deleted the earlier record versions, reducing the NZAC dataset to 102,092 records.

There are indeed 1186 duplicate catalogue numbers: cut -d' ' -f 43 occurrence.txt | sort | uniq -d | nl

But there are no duplicate occurrenceIds: cut -d' ' -f 68 occurrence.txt | sort | uniq -d

These also exist in the source DWCA from the publisher. An example pair: https://www.gbif.org/occurrence/1325923162 and https://www.gbif.org/occurrence/1503321502

It's not necessarily wrong to duplicate a catalogue number. A specimen (e.g. herbarium sheet, stomach sample) may contain multiple specimens, perhaps unknown when the specimen was prepared. I'm not sure if this should be a warning-level issue when detected in the dataset.

In this case, the source website only shows one occurrence for the catalogue number, so the publisher / collection probably has a problem: https://scd.landcareresearch.co.nz/Specimen/NZAC04127737

[Downloads] The associatedMedia, geodeticDatum, verbatimCoordinates, verbatimLatitude, verbatimLongitude and scientificNameAuthorship fields are dropped without replacement during processing, for unknown reasons.

I think associatedMedia is dropped because the media has been moved to a multimedia extension (multimedia.txt). verbatim{Coordinates, Latitude, Longitude} are dropped as they are always unchanged from the field in verbatim.txt. scientificNameAuthorship is dropped because the authorship is included in the scientificName.

geodeticDatum is dropped because decimal{Latitude, Longitude} have been reprojected to EPSG:4326, but we could add a default element: <field default="EPSG:4326" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>

The country field is dropped but its items are processed (with additions, corrections or exclusions) into countryCode in occurrence.txt.

The country has been processed, and will always match the countryCode in the interpreted view.

Minimum and maximum depth and elevation are recalculated by GBIF during processing. In occurrence.txt, minimumDepthInMeters and maximumDepthInMeters are replaced by depth and depthAccuracy, where “depth” is either the single depth value supplied, or the mean of the supplied minimum and maximum, and “depthAccuracy” is the average deviation from the mean. minimumElevationInMeters and maximumElevationInMeters are similarly replaced by elevation and elevationAccuracy.

These are GBIF terms, not part of DWC. I'm don't know why we do this.

GBIF adds genericName and species fields to its processed tables. The terms are defined by GBIF [...] The first field is “The genus part of the scientific name”, yet in many MV records genericName contains a non-genus name. The species field contains “The canonical name without authorship of the accepted [processed] species” and seems to be the same as the species field in the recommended GBIF download. I ignored the genericName and species fields in the audit.

[Handle later.]

I also found that there are ALA fields populated with data items with the corresponding GBIF fields completely blank. These are not losses due to processing, since the fields are also blank in the verbatim.txt file. The field contents were evidently not supplied to GBIF, either by ALA, which acts as Australia’s GBIF node, or by the data provider.

Yes, I'm not checking this as I know it's the case, but it's an issue for ALA.

Taxon names

[Investigate later.]

Losses of date information were common and evidently due to processing rules written to deal with various date formats. In the modified field in the NZAC dataset, for example, GBIF successfully parsed 4765 entries in YYYY-MM-DDTHH:MM:SS+12:00 format, but deleted 97,327 entries in YYYY-MM-DDTHH:MM:SS.sss+12:00 format (95% data loss).

Recorded as https://github.com/gbif/parsers/issues/12

[ALA handling date intervals]

GBIF also doesn't handle date intervals, see https://github.com/gbif/portal-feedback/issues/652

GBIF processed most names without change in the recordedBy and identifiedBy fields, but excluded 443 recordedBy entries representing 119 unique name strings in the MV dataset. A check of several of the excluded entries shows that they were accepted by GBIF in other records. For example, “Peter K. Lillywhite - Museum Victoria” was excluded in four records but accepted in 233 others. This inconsistent processing resulted in a loss of <1% of valid data items.

In download 0015687-171219132708484, "Peter K. Lillywhite - Museum Victoria" (and only that) is in 270 records in both verbatim.txt and occurrence.txt.

it seems unlikely that ALA and GBIF programming staff or contractors have systematically compared original and processed data to look for problems in selected fields, as I did

it is hard to understand how losing a taxon name through fail-matching or up-matching improves an occurrence record

We changed the portal to show the verbatim name where we didn't have a match, should we do that for downloads? Just CSV ones? Or better document the structure of a download, so users know to use verbatim.txt.

aggregators to include both original and processed taxonomic data items in each record

Which we do, but because we know the row in verbatim.txt is also part of the record in DWCA.

@timrobertson100, comments on the background to any decisions welcome! Bob is presenting his paper at 11:20 on Thursday (01:20 in Copenhagen) : https://spnhctdwg18.sched.com/event/G4V4/look-what-theyve-done-to-our-data-how-aggregators-change-data-items-in-collection-records

MattBlissett commented 6 years ago

Taxon names:

It's possible this is old, or only applied to ALA, but deserves investigation and perhaps regular checking: