gbif / gbif-api

GBIF API
Apache License 2.0
27 stars 5 forks source link

wrong matchType in species/match api call #129

Open abubelinha opened 1 month ago

abubelinha commented 1 month ago

I've been trying api matches against the GBIF backbone taxonomy during the last few days and I have a question about one particular case:I was tried to match ths name in several of our specimens: Brunella grandiflora (L.) Jacq. em Moench. http://api.gbif.org/v1/species/match?name=Brunella%20grandiflora%20(L.)%20Jacq.%20em%20Moench.

Brunella Mill. is an unaccepted genus in backbone (a synonym orthographic variant of Prunella L.).

So I was expecting this api call to return some kind of matchType:FUZZY

Or perhaps a matchType:HIGHERRANK ... but not like this:

  1. For some reason, the api is not taking into account that synonym relationship, and also not considering that both genus are in the same family (Lamiaceae).    Instead, api matches a species of genus Brunellia (family Brunelliaceae, not even in the same taxonomic order or class). Shouldn't the genus synonym relationship and higher taxonomic ranks help api to give a taxonomically-closer match here?    Especially when "Brunella" spelling is not closer to "Brunellia" than to "Prunella" (one letter difference in both cases). Why did the api take the worst taxonomic option here?

  2. I passed a species name, and the api returns a name match at species level too.     But it is flagged as matchType: HIGHERRANK.     I'd say that is bug. If not, what's the logic behind it?

  3. Regarding the wrongly matched "species", in web species interface it is called Brunellia grandiflora (no authorship, and no children taxa in left panel). It also says:

    This record has been created during indexing and did not explicitly exist in the source data as such. Origin: Generated, missing genus or species for ‘orphaned’ lower name.

    The api response gives not exactly that information. I see these flags:

    "nameType": "SCIENTIFIC",
    "rank": "SPECIES",
    "origin": "IMPLICIT_NAME",
    "taxonomicStatus": "ACCEPTED"

    Is IMPLICIT_NAME the one, isn't it? But I am not sure I understand the implications: At a first glance I'd say this name was created in backbone because there are some other infraspecific names which need to be put somewhere in the tree. But as I said, there are no children names in the species page, so I don't really understand what is going on.

Thanks in advance for possible explanations @abubelinha

mdoering commented 1 month ago

Matching is always a very delicate balancing act. In this case the species Brunella grandiflora is just one character off and matches quite well. It is a result that I would expect. Swaps in the first character are not handled by the fuzzy matching. The first character is always taken as it is. You can get some insight into the matching results by specifying verbose=true and see considered alternative matches and how they scored:

http://api.gbif.org/v1/species/match?verbose=true&name=Brunella

In order to avoid matching to species further away in the classification you should provide some taxonomic context. For example if you give the family it matches to your expectations:

http://api.gbif.org/v1/species/match?verbose=true&name=Brunella%20grandiflora%20(L.)%20Jacq.%20em%20Moench&family=Lamiaceae

In general it should be avoided to match just on the name alone. Especially with genera there are too many homonyms and closely spelled alternatives.

It seems as if the emendation part of the authorship is considered by the matching to be an infraspecific epithet. Hence you get higher rank matches even to the species in your link above. You get more alternatives to be considered when the em part is removed:

http://api.gbif.org/v1/species/match?verbose=true&name=Brunella%20grandiflora%20(L.)%20Jacq.&family=Lamiaceae

mdoering commented 1 month ago

IMPLICIT NAMES are names that exist in the backbone because another accepted name exists that implicitly requires the existance of that name. E.g. a species implicitly requires the genus to exist and a subspecies the species. In this case the subspecific autonym Brunellia grandiflora subsp. grandiflora from the Redlists has caused this:

https://www.gbif.org/species/176820144

This is a rather rare case when apparently only the autonym existed and no other subspecies. In that case we remove the autonym from the backbone and only keep the species.

abubelinha commented 1 month ago

Thanks for your answers @mdoering I already thought about the convenience of passing a family in the request. But that only applies when you already have some previous info about the name string (i.e., it comes from some previous database, and not just from a collection label or a loan list with just species names).

But yes. In this particular case I was testing names from a database which also has some family information. The problem is that family information is somehow untrustable, old and coming from several sources (depending mainly on the source where the first specimen name of that genus was taken from: most of the time a quite old local flora, sometimes Flora Europaea, Tropicos, IPNI or the Index Nominum Genericorum ...)

As a result the family I can pass may or may not match the one in the backbone. In that case, wouldn't I be fooling the matching api and perhaps getting worse matches than if I just don't provide anything but name and kingdom? I can think of two possible confussions:

  1. Families with multiple names: Is the matching api always smart enough to handle any of them? Lamiaceae / Labiatae Apiaceae / Umbelliferae Brassicaceae / Cruciferae Poaceae / Gramineae ... and so on (no idea of how many pairs exists like these)

  2. Also, I am particularly concerned about differences between APG taxonomy vs classic taxonomic groups which would make my requests fail if I pass an "old family" to the api.

Should I first try to generate an updated list of families? I wonder if the same species/matching api could be used for this. In order to know "where in GBIF backbone taxonomy should by placed this genus", would this be a right way of handling the problem?

  1. I generate a list of myfamily-mygenus pairs from my institution (to be corrected now). One by one ...
  2. I pass them to species/match (setting rank=GENUS&genus=mygenus&family=myfamily) and check the results:
    • If I get a matchType HIGHERRANK, then something went wrong and I should try again without family=myfamily? Not sure about this
    • When I get an EXACT match, should I store it together with its backbone-returned family, so I use this family in my future infrageneric searches of species within this genus?
    • For mygenus with non-EXACT matches, just don't use any myfamily in future infrageneric searches?

So, in step 2, this would be the query for the wrong Brunella genus, extracted from the above species string: https://api.gbif.org/v1/species/match?name=Brunella&verbose=True&kingdom=Plantae&rank=GENUS&family=Labiatae In this case the response says backbone takes it as synonym of Prunella, and also places it in "Lamiaceae" (instead of Labiatae). A similar one could be Ionopsidium/Jonopsidium which belongs to Brassicaceae/Cruciferae.

An example with one of those APG-changed families would be Veronica and many other formerly placed in Scrophulariaceae family. If I pass family in this request: https://api.gbif.org/v1/species/match?name=Veronica&verbose=True&kingdom=Plantae&rank=GENUS&family=Scrophulariaceae I am told that it matches, but it is now in Plantaginaceae

OK, so when I try to match any species of Brunella (according to our labels) I should pass in "Lamiaceae", and for Veronica species I should pass "Plantaginaceae".

Or this (a bit overcomplicated for me) is fully innecesary and you think I can safely pass the old families and the api will safely recognize them?

BTW. I am confused with these two api request parameters. What's the difference between them, and whether it is useful or not to pass anyone in the above requests:

species/match api documentation doesn't say anything about the data type for the second one.

djtfmartin commented 1 month ago

If family is supplied in the web service request, then its only used in secondary step of the matching algorithm to chose between 2 or more candidate names that closely match. So if the supplied family name is outdated, it'll largely be ignored and won't have a big impact on matches. The algorithm tries to pick the best candidate and the comparing of the higher taxonomy is just one part of this (ranks, authorships, name similarity are others).

For the genericName vs genus question - the presence of both relates to the two separate darwin core terms. Occurrence records may be published with one, both or neither of these fields. But the genericName and genus can be different for synonyms (as described in the darwin core terms).

The genericName name is used to construct a full scientific name using the specificEpithet (and other name parts) which can been supplied in separate request parameters.

The genus field is only used to match to a higher taxon when matching with scientificName (or the name constructed with name parts) has been unsuccessful.

Hope this helps.

We should add the data type for genus - good catch.

abubelinha commented 2 weeks ago

Thanks for all your helpful comments @djtfmartin I am a bit busy this month but hope to get back to this in a few weeks so I appreciate to see these things documented here. In the mean time, I caught another api response which looks a bit odd to me:

The name I tried to match is "Dryopteris borreri Newman" https://api.gbif.org/v1/species/match?name=Dryopteris%20borreri%20Newman The backbone accepted name I expected to get back is "Dryopteris borreri (Newman) Oberh. & Tavel" But the api returns a genus rank match instead: "Dryopteris Adans." (taxonMatch = HIGHERRANK)

If I use a search without providing authorship, I get the expected result: https://api.gbif.org/v1/species/match?name=Dryopteris%20borreri

Isn't it a bit odd that the 1st request -providing an "imperfect but not so bad" authorship- returns a "worse" response than 2nd request -which only provides a canonical name-?

When there is not a good EXACT or FUZZY match using the provided authorship, shouldn't it be more appropriate to fall back to try the canonical name before giving up and returning a HIGHERRANK match?

Maybe this is difficult to achieve for some technical reason? (i.e. having to make a 2nd database call which slows down things a lot, or whatever). Thanks in advance for providing explanations of this behaviour

mdoering commented 2 weeks ago

you can add the query parameter verbose=true to get more insight into what happens:

https://api.gbif.org/v1/species/match?name=Dryopteris%20borreri%20Newman&verbose=true

You now see the list of considered potential matches as alternatives. This list contains Dryopteris borreri Newm. twice as a pro parte synonym of different accepted names and scoring highest (126 vs 83 for Dryopteris borreri (Newman) Oberh. & Tavel).

Additionally there is also:

I agree this is unfortunate behavior though.

abubelinha commented 2 weeks ago

Thanks @mdoering the pro parte explanation makes sense.

I catched a new possible issue, or at least I cannot explain why this happens. I am looking for the backbone accepted name for a grass subspecies name published here (but not yet included in backbone with any status, I think): Festuca vasconcensis (Markgr.-Dannenb.) Auquier & Kerguélen subsp. actiophyta (M.I.Gut.) Mart.-Sagarra & Devesa

Why does this happen? 2nd one is not an exact match, is it? I would expect to see a matchType=HIGHERRANK in both cases.

abubelinha commented 1 week ago

Another unfortunate match not being reported as HIGHERRANK. This is even worse than previous one because now the requested subspecies name is not even shown among the alternatives. Why so? And why is matchType=EXACT ?

Linaria polygalifolia Hoffmanns. & Link subsp. aguillonensis (García Mart.) Castrov. & Lago

I would expect that matching this backbone accepted subspecies (or at least ranking it very high): https://www.gbif.org/species/7423643