gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
17 stars 4 forks source link

Taxa detection #2

Closed sbodese closed 9 years ago

sbodese commented 9 years ago

Hi, it seems that the ECAT service also detects non taxa terms like this example:

given: Amphipoda biomass as carbon

returns: Amphipoda biomass as

So, biomass is not a scientific name, taxa, "as" also not. So what can we do? I thought Ecat extracts the real taxa terms of a given string?

Greetings

Steffen

timrobertson100 commented 9 years ago

Can you please provide an example URL or provide information on how you are using it?

The species match service (the service that is used to locate taxa in the backbone taxonomy) will only find Amphipoda for that name: http://api.gbif.org/v1/species/match?name=Amphipoda%20biomass%20as%20carbon

There are other services that will try and atomize a name string into constituent parts, but they don't rely on any "complete" taxonomic knowledge other than blacklists and algorithms IIRC.

sbodese commented 9 years ago

At first: i mean only the name parser service! i use this uri: http://api.gbif.org/v1/parser/name with a REST client like this: http://api.gbif.org/v1/parser/name?name=Amphipoda%20biomass%20as%20carbon

documentated like this: http://www.gbif.org/developer/species#parser

other example:

given: all cis Eicosapentaenoic acid per unit mass total organic carbon returns: All cis

http://api.gbif.org/v1/parser/name?name=all%20cis%20Eicosapentaenoic%20acid%20per%20unit%20mass%20total%20organic%20carbon

We only interested in the taxa term detection, because many taxa from us are not at species level.

mdoering commented 9 years ago

The name parser expects a reasonably correctly formed scientific name and parses it into its parts. It does not check if individual name parts like the genus, species or infraspecific epithet actually exist. If you pass it english words or any strings that appear like a bi/trinomial it will parse it.

Basically anything of the following form will be parsed, brackets are optional:

Genus [species] [Authorship] [rankMarker] [infraspecies] [Authorship]

The authorship can contain spaces, the year and is rather flexible, but needs to start with an upper case letter or a name preposition like "von", "van".

In addition to that nomenclatural notes, hybrid names, taxon concept references, strain and cultivar names are also parsed.

The actual parsing is a lot more complex though.

sbodese commented 9 years ago

He folks whats going on here ? your documentation says "Parses a scientific name string and returns the ParsedName version of it. Accepts multiple parameters each with a single name. Make sure you url encode the names properly. "

So your name service doesnt work correctly so here is no "invalid" status. The name service like described should return nothing it is not a valid scientific string. Otherwise your tool have a strange behavior

so Marcus, i dont agree !

Parse a valid taxa string with this tool is a total obsolete task ... this should you consider as a developer. This Ecat tools expects a fixed term order to detect terms, so if we have data with a another structure ... this is not flexible

timrobertson100 commented 9 years ago

The match service (the one I linked to in the first answer) is the service that does what you are expecting - it uses a taxonomic backbone to verify the content. The name parsing service is one that atomizes content as Markus describes, but makes no use of dictionaries of names.

If you are dealing with a dataset of scientific names and want to split them into atoms, then use the name parsing service. If you are dealing with content that is dirty - i.e. it contains things that are not correct scientific names - then you will likely have to make use of both services.

sbodese commented 9 years ago

ok, thanks. I suggest that you update the ecat documentation like Marcus comment, you should define that you expect as a Scientific name (biol. nomenclature ?). ECat seems also like a taxa parser, means extract taxa by a given string. So if i understood Marcus right, normally should this ecat tool nothing if the "scientific name" has not the expected structure? We have parsed with Ecat a big set of string like this: Paratendipes sp., headcapsules per fresh sediment volume

http://api.gbif.org/v1/parser/name?name=Paratendipes%20sp.,%20headcapsules%20per%20fresh%20sediment%20volume

it works fine and give us the taxon term. Finally, i mean, if ecat detect no taxa it should return null.