Acanthiza / envClean

Clean biological data from large unstructured dataset(s)
https://acanthiza.github.io/envClean/
Other
0 stars 1 forks source link

Override for gbif taxonomy in make_taxonomy #18

Closed Calamanthus closed 1 month ago

Calamanthus commented 6 months ago

We need some sort of override to deal with cases where the gbif taxonomic backbone is allocating the wrong taxonomy.

This includes the the Hooded Plover (old name) being allocated to Red-necked Phalarope: rgbif::name_backbone_checklist(c("Thinornis rubricollis","Phalaropus lobatus")) A tibble: 2 × 25 usageKey acceptedUsageKey scientificName canonicalName rank status confidence matchType kingdom phylum order family genus species kingdomKey phylumKey

1 4849941 5739290 Thinornis rubricolli… Thinornis ru… SPEC… SYNON… 98 EXACT Animal… Chord… Char… Scolo… Phal… Phalar… 1 44 2 5739290 NA Phalaropus lobatus (… Phalaropus l… SPEC… ACCEP… 99 EXACT Animal… Chord… Char… Scolo… Phal… Phalar… 1 44 Also, another example I found recently where Cassia nemophila var. coriacea was being allocated to a Brazilian species in a different genus (Chamaecrista coriacea), instead of to Senna artemisioides: rgbif::name_backbone_checklist(c("Chamaecrista coriacea","Cassia nemophila","Cassia nemophila var. coriacea")) A tibble: 3 × 25 usageKey scientificName canonicalName rank status confidence matchType kingdom phylum order family genus species kingdomKey phylumKey classKey orderKey 1 2950087 Chamaecrista coriac… Chamaecrista… SPEC… ACCEP… 99 EXACT Plantae Trach… Faba… Fabac… Cham… Chamae… 6 7707728 220 1370 2 5357114 Cassia nemophila A.… Cassia nemop… SPEC… SYNON… 97 EXACT Plantae Trach… Faba… Fabac… Senna Senna … 6 7707728 220 1370 3 5357116 Cassia nemophila va… Cassia nemop… VARI… SYNON… 98 EXACT Plantae Trach… Faba… Fabac… Cham… Chamae… 6 7707728 220 1370 The above example came out as having 100% of it's national AOO and EOO in the current study area, because of the erroneous attribution to a Brazilian species, when it should be an ultra common plant in Senna artemisioides. Interestingly, as seen above, without the variety (i.e. just 'Cassia nemophila') gbif attributes the taxonomy correctly.
Calamanthus commented 6 months ago

Another slightly different example... Gypsophila australis was identified by red as having 100% of its range in the current study area, and it is accepted by the gbif taxonomic backbone, but is regarded as a synonym to Gypsophila tubulosa by Flora of Australia and ALA, which is identified in bdbsa as a weed. GBIF does not recognise G.tubulosa and assigns it to the family level, so it cannot be a straight taxonomy fix. It may be easiest to just attribute this to genus (which is accepted) and effectively remove it from the data, as I can think of lots of problems with other potential solutions. In this case it won't matter if the taxa is lost as it is a weed, but I just hope there isn't a similar example for a species that is indigenous. I hate this stuff! I'll start compiling these examples into a fixes table.

Acanthiza commented 6 months ago

I've added in a taxonomy_overrides argument to make_taxonomy. It follows a similar form to taxonomy_fixes but is implemented differently (via left_join) due to the string-to-find and the string-to-replace being in different columns. However, another tack-on-fix at this point highlights the precarious nature of our current taxonomy workflow. I've changed the appropriate code in envPIA, but envClean::taxonomy_overrides currently only deals with the Hooded Plover issues. I'll leave this open until it has been tested more thoroughly.

Calamanthus commented 6 months ago

Great, thanks. I'll add the other cases above. Where is envClean::taxonomy_overrides? I can't see it under ~/packages/envClean...

Acanthiza commented 6 months ago

It's in the taxonomy_fixes.R file (in data-raw).

Calamanthus commented 6 months ago

Ok, I was thinking it was its own script

Calamanthus commented 6 months ago

Just looking at the code...if this is just changing the taxa in the lutaxa result, then the best key will presumedly not be relevant to the updated taxa. Do we want an override field in lutaxa to flag that the gbif taxonomy has been changed and the best key is no longer relevant?

Acanthiza commented 6 months ago

The link to the taxonomic hierarchy to use (in taxa$taxonomy) is via the taxa column. The key wasn't being used. I've now removed the key from lutaxa output. I've tried to think of a way to make these fixes earlier so that the correct other attributes (e.g. the status, matchtype and rank) come through, but haven't been able to work out how to implement it. So yes, perhaps worth flagging that those fields may be incorrect in the lutaxa results if a 'fix' has been made.

Acanthiza commented 1 month ago

Closing as this has moved on and is implemented (differently) in the current version of make_taxonomy that calls galah::search_taxa instead of rgbif::name_backbone_checklist

Calamanthus commented 1 month ago

I went to close this last week for the same reason, but left it open, as the overrides for galah are still not working properly. I will start a new issue for that when I have a chance.