gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
17 stars 4 forks source link

a taxon with only a single term for parsing #1

Closed sbodese closed 7 years ago

sbodese commented 9 years ago

a lower case taxon with only a single term will be not processed by the ECAT tool. Ecat delivers a empty string for this: Examples: "ammolagena", = empty, "" "Ammolagena", = Ammolagena "heterostegina", = "" "operculina","epistomina", = "" "neogloboquadrina pachyderma sinistral d13c standard deviation", = neogloboquadrina pachyderma sinistral "Globigerina", = Globigerina "globigerina", = "" "Globigerinita", = Globigerinita "Neogloboquadrina", = Neogloboquadrina "neogloboquadrina" = ""

taxon with more than one term are correct detected.

Greetings from Bremen

mdoering commented 9 years ago

This looks correct. A lowercase species epitheton on its own as sometimes found in zoology is not a real scientific name. We could try and parse it as a species epithet alone, but I am afraid this widens the door for parsing all sorts of rubbish

sbodese commented 9 years ago

i dont mean only the genus part of a species taxon, i mean in general all single term lower case taxons will not be "detected - recogniced" by the ECAT tool. see examples or this: "chromista" =ECat => "" For me its not the right and correct behavior to return a empty string for a valid taxon. Also if i submit a lower case string with a taxon like this : "neogloboquadrina pachyderma sinistral d13c standard deviation", ECAT detects this: neogloboquadrina pachyderma sinistral: So ECat have a ambiguous behavior. So i mean this tool should detect a possible taxon in every case. A lower case word cannot be a reason for a unsuccessfully detection and, normally all text parsing - mining tools do also a lower case function inside the process chain ...

mdoering commented 9 years ago

The trouble will be it will then recognize ANY single word as a taxon, not sure if that is useful

sbodese commented 9 years ago

I dont know the logic of ecat to detect a taxon. Only about the reason of lowercase can cause a wrong detection? My question here was mostly related to a use case about the usage of ECat inside a string processing chain with normalized Strings. I have changed this. Otherwise i suggest to correct the ECat behavior or to block lower case strings with a result message inside the bulkprocessing of a given string array. I mean also to detect that a single term is taxon is also a regular use case. The ECat webservice delivers also a taxonomic rank for a taxon.

mdoering commented 7 years ago

We won't support all lowercase names for parsing. You might try the GNANameParser implementation wrapping gnparser, but it also fails for "chromista" for good reasons: http://parser.globalnames.org/?q=chromista