gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Species name update #611

Closed Andreas-Bio closed 7 years ago

Andreas-Bio commented 7 years ago

Sorry wrong repo. This belongs here. https://github.com/gbif/portal16/issues/631

MortenHofft commented 7 years ago

Closing gbif/portal16#631 in favour of this. original issue below


So there seems to be a logical flaw in the update routine (the fuzzy matching is too forgiving). I don't know exactly what is happening behind the scenes, so this has somewhere from minimal to big impact.

2014: COL first lists Atriplex northusana as a species http://www.catalogueoflife.org/annual-checklist/2015/details/species/id/15adc98e2ea27307681beae3478f2495 2015: GBIF imports the record and gives it an accepted status. A species "homepage" is generated. 2016: The name gets removed and changed (for whatever reason). ITIS and COL change the name (and the way that COL works, it also gives a new species ID to Atriplex northusanum Weim. Source: http://onlinelibrary.wiley.com/doi/10.1002/fedr.19120112104/full The old record can not be found any more in these databases! (you can grab the problem by the roots here) However: https://www.gbif.org/species/7977331 Only updates the references and lists the 2015 COL dump still as a primary source, although this dump is outdated and contains potentially invalid names that have been corrected in the past. BUT the fuzzy matching routine matches all new names to the existing old species "homepage"...

Ways to prevent this:

If a new COL version gets released the new lists get prority over already existing names from the same data source. Give me a red fat warning at the top of the page. It is not realistic to expect ppl to check every name in the reference list for mismatches. Majority vote. If the majority of references cites a different name, update the species backbone, or give me a visual warning if there is a mismatch in the "reference" list. Disable fuzzy matching completely. (It causes more harm than good.) Thanks for reading this late night gibberish!

Summary: Name changes in databases are not updated in GBIF consequently. GBIF retains the old legacy dataset and lists the new name as fuzzy match to the old name, although the old name does not exist in any database any more except GBIF. The longer this goes on the more errors will accumulate. This is a systematic error and can not be fixed individually.

TLDR: If a species name is no longer listed, but was listed before in the same dataset, some action must be taken.

mdoering commented 7 years ago

Names do change in GBIF. The issue here seems to be that we simply have not updated our backbone since early this year and CoL must have still used the wrong name then.

Accepted in:

  1. CoL 2016 Annual Checklist: http://www.catalogueoflife.org/annual-checklist/2016/search/scientific/genus/Atriplex/species/northusana/match/1
  2. CoL 2017 Annual Checklist: http://www.catalogueoflife.org/annual-checklist/2017/search/scientific/genus/Atriplex/species/northusana/match/1

It is present as a synonym in the CoL post 2017, but not released as AC yet: http://www.catalogueoflife.org/col/search/all/key/Atriplex+northusanum/fossil/0/match/1

In my eyes the solution to the problem would be to update our backbone more frequently, e.g. monthly. Currently this unfortunately is a very costly procedure that takes 10 days of work during which we have to stall occurrence and checklist indexing. This needs a much better automatization and quicker update times, mostly on the occurrence index (hbase) and its derivates (solr, hive, maps)

mdoering commented 7 years ago

Created a new issue to actually update our backbone: #628

rdmpage commented 7 years ago

@mdoering Does rebuilding the backbone mean that the ENTIRE occurrence index is rebuilt? If so, wouldn't it be possible to re-index only those parts where the backbone actually changes. In other words, do a diff on the old and new backbone, and for those taxa whose composition has changed, reindex just the affected occurrences.

mdoering commented 7 years ago

in theory partial updates could work, but yes, in practice we do update the entire index and reprocess all our occurrence data. Especially since the updates are not that frequent and therefore not that small. Also with fuzzy matching it is harder to predict which occ records are effected and which are not.