CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

CLB: remove portion [sic] from the names and flag these names with status "orthographic variant" #1293

Open yroskov opened 7 months ago

yroskov commented 7 months ago

The CoL has at present 2,588 names with a comment [sic]: https://www.catalogueoflife.org/data/search?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=extinct&facet=environment&limit=50&offset=0&q=sic&sortBy=taxonomic

Latin comment sic (so, yes!) is used widely in Zoology to indicate misspelled name.

Quite often, species names with portion [sic] in CLB are recognized as trinomials (subspecies as quadrinomials (with appropriate cuttings)). All of these creates problems.

It would be nice, if CLB automatically remove portion [sic] from the names and give them a name status "orthographic variant" or "misspelling".

yroskov commented 7 months ago

It looks like, the issue of handling [sic]names needs coordination with @gdower.

mdoering commented 7 months ago

See https://github.com/CatalogueOfLife/backend/issues/1059 and a few from the parser:

mdoering commented 7 months ago

For the vast majority of synonyms this seems reasonable to do. But there are also a few accepted names with [sic] - what to do with these?

mdoering commented 7 months ago

Interestingly the first accepted one I checked is a worms genus, which claims to be a misspelling on the worms site: https://www.molluscabase.org/aphia.php?p=taxdetails&id=536363

yroskov commented 7 months ago

there are also a few accepted names with [sic] - what to do with these?

I would give "provisionally accepted" status to all accepted names with [sic] portion.

mdoering commented 7 months ago

When interpreting names we actually remove sic already and keep a flag originalSpelling instead on the name: https://github.com/CatalogueOfLife/backend/issues/501

This is then being rendered again into [sic] in the label again.

The examples you had given dont do that though, but have [sic] as part of their authorship instead: https://api.checklistbank.org/dataset/286246/name/Z0ZOIgGBCF8_D_s4WIeCc

As this is from a january release and WoRMS datasets are updated pretty much veery every month it must still interpret things wrongly then. Not so in tests, difficult to reproduce.

mdoering commented 7 months ago

Ah, I can reproduce it when sic is supplied as authorship and not the scientificName!

mdoering commented 7 months ago

The name interpreter should now look for sic and corrig statements inside the authorship too. That means the original flag is populated and sic shown in the label, but it should not be considered an epithet or be part of the authorship string.