CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

Misparsed authorshipstrings containing two-letter author names #176

Open aoern opened 3 years ago

aoern commented 3 years ago

@yroskov @gdower There are 202 erroneous authorship strings in Sep 1 edition (DwC) due to misparsing of two-letter author names. Some examples:

Duguetia ruboides H.e.Maas in AnnonBase (should be Maas & He) Alara improba W.u.Yang, 1993 in FLOW (should be Yang & Wu) Acrolithus brevis M.a.Freytag, 1988 in MOWD (should be Freytag & Ma)

The complete list is here: CoLTwoLetterErrors.xlsx

mdoering commented 3 years ago

It looks like not all source datasets have been reimported and resynced since we changed the code to keep the exact verbatim authorship. The given example Duguetia ruboides from AnnoBase is still from 2019: https://api.catalogue.life/dataset/1040/taxon/t48022

You can see the author is still badly parsed, but the important authorship string does not change: http://api.catalogue.life/parser/name?name=Abies&authorship=Maas%20%26%20He

mdoering commented 3 years ago

The 2 letter author error came into existance because of this: https://github.com/gbif/name-parser/issues/28 Will make sure ampersands are excluded and dots are required for the gbif/name-parser#28 patch to apply