gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
18 stars 4 forks source link

Parsing of `Tethea or` drops the epithet #99

Closed djtfmartin closed 1 month ago

djtfmartin commented 1 month ago

Parsing of Tethea or (Denis & Schiffermüller), 1776 results in the specific epithet "or" being dropped and the name being interpreted as a genus.

Parsed correctly in current GBIF API (older parser): https://api.gbif-uat.org/v1/species/match?name=Tethea or (Denis & Schiffermüller), 1776

CLB API (current parser): https://api.checklistbank.org/dataset/53147/match/nameusage?scientificName=Tethea%20or%20(Denis%20&%20Schifferm%C3%BCller),%201776

Debugging shows the Issue is at the parser level (as opposed to WS API impl).

Screenshot 2024-07-17 at 21 40 02
mdoering commented 1 month ago

Parsing small 2 char epithets that resemble english words is hard to support. The year is also not within the brackets which might cause trouble, the GBIF parser struggles with the authorship too: https://api.gbif.org/v1/parser/name?name=Tethea%20or%20(Denis%20&%20Schiffermüller),%201776

For some manual tweaking the CLB parser can be manually configured for special cases like this. Basically you can configure expected results for specific names and authorships: https://api.checklistbank.org/parser/name/config

I pushed a config for this name to CLB: https://api.checklistbank.org/parser/name/config/Tethea%20or|(Denis%20&%20Schiffermüller),%201776

This works now: https://api.checklistbank.org/parser/name?q=Tethea%20or%20(Denis%20%26%20Schiffermüller),%201776 https://api.checklistbank.org/parser/name?q=Tethea%20or%20(Denis%20%26%20Schiffermüller,%201776) https://api.checklistbank.org/parser/name?q=Tethea%20or

mdoering commented 1 month ago

well, if you use the parser outside of CLB we need to make sure to load the configs from CLB - which stores them in the db. Haven't paid much attention to that really