gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Identifications in an extension are not interpreted #435

Open dshorthouse opened 3 years ago

dshorthouse commented 3 years ago

I have tried several times this week to download using the Bionomia format & then process records. My first check is to discover what, if any, datasets have significantly dropped in number of records relative to processed records two weeks prior because this would otherwise wipe out the work people have done in attributing specimen records to collectors/determiners.

The following dataset appears to require re-harvesting at your end because it is entirely missing from the download index or has some other peculiar issue. I do see that many (all?) records are flagged as "incertae sedis" whereas there is a determination to species in an identification history extension. However, there is now nothing in scientificName. Perhaps this too is new.

https://gbif.org/dataset/7377c214-e7f1-4fc0-a9de-3b85728ccc11

MattBlissett commented 3 years ago

Thanks for reporting this.

The previous version of this dataset (before 6 November) didn't use the Identification extension, it had everything on the main Occurrence core. The current version has identification terms (scientificName etc) in both the core and the Identification extension.

CC @jholetschek as this seems like a bug in BioCASe

We don't use the Identification extension during our interpretation, which is why everything appears as incertae sedis.

I've removed the DWCA endpoint, which will cause a re-harvest using the ABCD archive. We do look at the identification elements in ABCD records. That should restore the records, and we either need to look at the extension data, or BioCASe should populate the terms in the occurrence core with the primary identification.

Note that large downloads are made from a static table, which is regenerated every day. The next process to regenerate that table will probably be completed at about 06:30 UTC tomorrow.

dshorthouse commented 3 years ago

Thanks for this @MattBlissett. I'm aware of the lag for large downloads so will trigger after lunchtime EST tomorrow. I suspect the identification history extension is not necessarily related to the reason this dataset was missing from the large download index; I merely noticed it when trying to figure out what was going on.

jholetschek commented 3 years ago

Thanks for reporting, @MattBlissett. This was caused by a missing "preferred" flag in the ABCD mapping. I've updated both archives, but harvesting ABCD should also have solved it.

jholetschek commented 3 years ago

Plus it was a bug in the ABCD > DwC-A transformation, I've fixed that.