CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Issues with Nameusage search #1089

Open javiermerino-tracasa opened 2 years ago

javiermerino-tracasa commented 2 years ago

Hello, As part of our update to EUNIS2 database, we are getting some species information from CoL. We are getting taxonomy from searching by name. For the most part, around 80% of the time, we get accurate results with the search, but for the remaining 20% we get unusually random erroneous results. Below is an example of a plant that returns a bacteria result. <https://api.catalogue.life/dataset/3LR/nameusage/search?q=Morella rivas-martinezii&limit=300> It is not until it reaches search result number 227 when it finally finds the actual taxonomy for "Morella rivas-martinezii"

Here are some two more examples:

Centranthus amazonum -> kingdom:Animalia, phylum:Chordata, class:Amphibia, order:Anura, family:Bufonidae, genus:Bufotes, species:Bufotes boulengeri Pastor roseus -> kingdom:Animalia, phylum:Nematoda, class:Chromadorea, subclass:Chromadoria, order:Chromadorida, suborder:Chromadorina, superfamily:Chromadoroidea, family:Chromadoridae, subfamily:Euchromadorinae, genus:Crestanema

Also, when the name ends in "all others" like for instance, "Periparus ater all others", then the search always assigns it the same result, with it being: kingdom:Plantae, phylum:Tracheophyta, class:Magnoliopsida, order:Caryophyllales, family:Amaranthaceae, subfamily:Chenopodioideae, genus:Bassia

Is this a bug in the search API or do I need some additional filters? Thanks a lot in advance. Javier

thomasstjerne commented 2 years ago

Hi @javiermerino-tracasa You can limit the search to a given higher taxon, e.g. Plantae using the TAXON_ID parameter. The default match type is WHOLE_WORDS which means that any name having rivas in the author string will match. If you change the match type to EXACT you will only get matches that have your query as a full substring.

Using TAXON_ID=P and type=EXACT will find your species here: https://api.catalogueoflife.org/dataset/3LR/nameusage/search?TAXON_ID=P&limit=50&offset=0&q=Morella%20rivas-martinezii&type=EXACT

You may however find that in some cases type=EXACT is too narrow when it comes to e.g. spelling variants and different author spellings.

javiermerino-tracasa commented 2 years ago

Hello Thomas,

We are using the name search for getting information for hundreds of names in an automated fashion. We cannot rely on the taxon_id suggestion since we will not know the taxonomy for each name beforehand.

As for EXACT, we can try using it first, and for those that it cannot find anything, then we do whole_words. This will improve the results for those names that match perfectly. However, it will remain as inaccurate for the examples I showed above.

Thanks. Javier

mdoering commented 2 years ago

Alternatively you can also use the PREFIX type search when you want the name to start with your query string. In your example the querystring Morella rivas-martinezii will be taken as 3 tokens: Morella, rivas & martinezii which will match the name or authorship. You can also restrict the matching to the name alone, avoiding matches to the authorship which in your example hits Rivas almost all the time:

https://api.catalogueoflife.org/dataset/3LR/nameusage/search?q=Morella%20rivas-martinezii&content=scientificName

mdoering commented 2 years ago

this is the prefix search: https://api.catalogueoflife.org/dataset/3LR/nameusage/search?q=Morella%20rivas-martinezii&content=scientificName&type=PREFIX

javiermerino-tracasa commented 2 years ago

Hello Markus,

Thanks for the help. I am now using type EXACT and then when it fails, content scientific name and it is working much better for us.