gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Incorrect filtering by red list category #495

Open andrewrodrigues opened 3 years ago

andrewrodrigues commented 3 years ago

There are number of examples where occurrences are being filtered by an incorrect red list category.

For example Least concern species Anaxyrus hemiophrys (Bufo hemiophrys synonym) coming out as Extinct in the Wild https://www.gbif-uat.org/occurrence/search?taxon_key=2422941&occurrence_status=present&iucn_red_list_category=EW. Looking on the red list website it is Anaxyrus baxteri that is Extinct in the wild https://www.iucnredlist.org/search?query=Bufo%20hemiophrys&searchType=species.

Another example is Least concern Betula pubsescens that comes out as Critically Endangered https://www.gbif-uat.org/occurrence/search?occurrence_status=present&iucn_red_list_category=CR . In the IUCN red list it is Betula klokovii that is critically endangered https://www.iucnredlist.org/search?query=Betula%20pubescens&searchType=species

dschigel commented 3 years ago

This looks very important to fix to the extent that GBIF backbone matching might not be welcome at all? Betula case is very striking, as pubescens must be one of the commonest species in the world. It used to be a common thing in FSU to describe species (including species of Betula) that were declared rare and restricted, but later were not holding their species status internationally. So, GBIF may be correctly "suspicious" that B. klokovi is in fact just a form of pubescens, but incorrectly assuming that a senior (and a very common) synonym (pubesens) inherits rarity status from its rare child (klokovii). I suggest that IUCN rarity is firmly glued to the name as provided by IUCN, irrespectively of the name status in the GBIF backnbone. How does it sound to you @andrewrodrigues?

andrewrodrigues commented 3 years ago

Frommy understanding. for red list categories, the matching is done against Red List taxonomy and not GBIF backbone as we had to find a way to get around different taxonomies. This is one reason I was surprised to see the B. pubescens example.

mdoering commented 2 years ago

Betula pubescens exists twice in the IUCN checklist: https://www.gbif.org/species/search?q=Betula%20pubescens&dataset_key=19491596-35ae-4a91-9a98-85cf505f1bd3&origin=SOURCE&qField=SCIENTIFIC&advanced=1

Once as Least Concern and once as a synonym to an Endangered species Betula klokovii

mdoering commented 2 years ago

So does the IUCN site: https://www.iucnredlist.org/search?query=Betula%20pubescens&searchType=species

ManonGros commented 2 years ago

Aside from what is already mentioned in the thread, here are some other species that come up in the filter although they probably shouldn't from what I could see (it might be worth double checking):

In most case, this seems to be an issue with some synonym being labelled by the IUCN. Given that the taxonomy used by the IUCN is different from ours, could there be a way to just label the species cited by the IUCN without adding the accepted species in the filter?

ManonGros commented 2 years ago

For the case of E. minima, the original species assessed by the IUCN is Euphrasia mendoncae: https://www.iucnredlist.org/species/162307/5571714 (which we don't have in the GBIF taxonomy I believe) but I guess we use the synonymy relationship given by the IUCN to find a match in our taxonomy? Perhaps we should not do that? how much information would be lost if we only labelled species names that are in the GBIF backbone?

mdoering commented 2 years ago

There was a lengthly discussion in the original thread that we should include synonyms: https://github.com/gbif/pipelines/issues/257#issuecomment-776571632

It will always be problematic to match two different taxonomies without the use of proper taxon concepts. But this is also true to ALL other occurrence identifications. It usually just isn't that obvious. All GBIF occurrences have weak identifications and are just roughly linked to our backbone. We don't really know the original taxon concept being used when they were determined.

Even if we exclude all synonyms and only use exact same names the matching might be wrong concept wise. And we will be losing lots of matches, but I cant say how many.

ManonGros commented 2 years ago

I understand @mdoering but at the same time, some of the IUCN taxonomy doesn't seem to follow the consensus and isn't always citing references. Would it make sense to not include synonyms that aren't documented elsewhere? Or would it be too complicated?

mdoering commented 2 years ago

It's not complicated, but I am not sure if that is what we really want. We should explore various cases to determine the impact and desired outcome for each of them.