gbif / checklistbank

GBIF Checklist Bank
Apache License 2.0
30 stars 14 forks source link

Taxon Match Higher Rank interpretation places IAS incorrectly #69

Open jlegind opened 6 years ago

jlegind commented 6 years ago

The verbatim name "Monomorium monomorium group" in the scientificName field of the 'IZIKO museum collections dating from 1800 - 2013' dataset[1], had been interpreted to read "Monomorium monomorium Bolton, 1987" with a comment "TAXON_MATCH_HIGHERRANK" under the "issue" column.

This has caused some confusion since the Monomorium monomorium species would be considered invasive in South Africa.

Clearly the stripping of the 'group ' label has enabled this particular species name match. Peter Hawkes who spotted the issue suggested a stricter validation regimen involving better flagging, feedback to the publisher and possible rejection based on taxon name interpretation.

[1] https://www.gbif.org/dataset/7b468209-8d5c-477b-b118-cdf04dde9912

mdoering commented 6 years ago

What exactly would be the desired match in this case? Sounds more like an occurrence identification interpretation issue than strictly species matching in case "group" is meant to refer to a vague set of species. Identification to such artificial groups of names or even just a pair of species in case the determination is unsure is a frequently needed feature GBIF and Darwin Core is missing.

PeterHawkes commented 6 years ago

In this case, the correct fix would have been to cut the IDs right back to just Monomorium (with "M. monomorium-group" perhaps being indicated under "identificationRemarks"), since the records all represent specimens identified only as far as a species group (in this case a large one of about 70 species in the Afrotropical region alone) within the genus Monomorium. Removing only the word "group" and appending the author and date for the species Monomorium monomorium meant that 602 records of unidentified Monomorium - probably representing dozens of different species from many localities - were all assigned the incorrect identification as a single invasive species not yet recorded from South Africa. The same kind of erroneous fix was applied to another 900 records in 8 other genera in this dataset...with no-one being aware and the data being publicly accessible for the past two years or more.

We cannot expect GBIF (at this stage at least) to have a completely accurate and up-to-date taxonomic structure in place for all taxonomic groups, so expecting perfect corrections would be unreasonable, which is why I think the data provider would be better positioned to decide on what the right correction should be.

My feeling is thus that the data provider should be alerted to any corrections made to their data - especially something as important as the identification - 1) so that they can ensure that the correct correction has been made on GBIF and 2) so that they can correct the data in their own database to prevent the same correction (whether correct or not) being re-done indefinitely and to improve the accuracy of their own database - which will be accessed both internally and externally via other routes (which will otherwise mean that researchers accessing the data via differing routes will obtain conflicting information).

mdoering commented 6 years ago

I agree matching to the genus would be best in this case. It would require to understand the meaning of "group" obviously. A hint in TaxonRank would be good. Searching for "group" in the GBIF Backbone returns quite some surprises, I wonder if there is any legitimate epithet called "group" in any scientific name. Our name parser should probably detect a binomial having a "group" appended and classify it as a species aggregate rank. Then we can a) exclude these names from the Backbone and b) deal with them properly during name matching, avoiding matches to the exact species.

As for notifications GBIF does not send notifications, but we do flag all these interpretation warnings both to users and publishers. So it is accessible for review and also includes things like interpreted coordinates, event time and fuzzy name matching. I wonder if publishers would really like to be informed about all their potential issues by email. @timrobertson100, maybe we should consider an optional notification hook that one can register with that sends out summaries about the GBIF indexing for a specific dataset via email?

PeterHawkes commented 6 years ago

I would very surprised indeed if there was a legitimate epithet "group" at any level in the taxonomic hierarchy...I cannot immediately find any injunction to this effect in the ICZN, but perhaps it is considered too obvious to need spelling out. Similarly, it would never be acceptable to us the word "Genus" or "Family" as an epithet, due to the obvious confusion this would cause. So I think we can safely assume that there will never be a formally published genus or species epithet "group" - even if someone attempted to do this, the peer-review process should ensure that it does not happen. This would allow your suggestion to be applied without fear of incorrectly removing a legitimate epithet "group".

Not being familiar with how data gets to GBIF (mine does so via AntWeb, to which I regularly provide updates), I am unsure how to handle notifications. With AntWeb, where an active upload by the data provider is the only way data can be added, an upload log is generated and a link to this provided directly after the upload; it is up to the data provider whether or not they bother to check the log and/or act on its contents. I always do because it it such a valuable check on the integrity and accuracy of my own database, and because I want to know that the data I am making public is as accurate as possible. The same could be done for active uploads to GBIF, but in cases where GBIF is crawling sites regularly to extract data, I take Jan's point that sending frequent unsolicited emails might get annoying (though if the data provider acted on the reports and corrected their data, this would minimize the need for - and size of - the reports...currently my upload logs generally contain few or no actionable issues, since all problems raised previously have been dealt with)...but depending on the frequency, I do not think it would hurt for a data provider to be reminded once a month that there are outstanding issues with their data, with a link to a full report to guide them to what needs correcting. In agreeing to provide data for public dissemination, providers do need to accept responsibility for trying to ensure that the data they provide is as accurate as possible...

In a total of about 260 000 records the Iziko dataset has over 43 000 instances of "Taxon match higherrank" (16.5%). While rectifying this might seem a daunting task, in many cases it would be fairly simple to do - for the ant examples, simply copying the data from scientificName to identificationRemarks and then doing a few search & replace operations to eliminate the species epithets and "group" from the scientificName field could fix the 1500 records with just a few minutes work.

MattBlissett commented 6 years ago

I'll just mention that "crawl" as used by the GBIF software developers isn't a clear term.

Data publishers must actively register a dataset with GBIF, essentially they give us a link to a Darwin Core file (ZIP file) or similar where they have published the data. Within a few minutes, GBIF's systems will download the data and add it to our index. From then on, approximately once per week, we will re-download that file and look for any updates.

We are working with publishers and the GBIF network to improve data quality, but across tens of thousands of datasets, no one solution will work.