AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
39 stars 3 forks source link

`select_taxa` returns incorrect IDs for more specific search terms #96

Closed T-LeB closed 2 years ago

T-LeB commented 2 years ago

Describe the bug When i run the ala_occurrences and select_taxa functions with a list of 360 target species they are returning a collection of occurrence with issues including:

I think what may be happening is whatever values select_taxa is providing to the ala_occurrences functions are slightly wrong or generalised possibly. If I run select_taxa over the list on its own every species has an exactly matched ScientificName but the species column excludes subspecies and species not formally described.

galah version 1.31

To Reproduce Steps to reproduce the behaviour:

  1. Read in target_species.csv (attached) target_species.csv

  2. Run the following code chunk using galah package ala_recs <- ala_occurrences(taxa = select_taxa(target_species$ScientificName))

Expected behaviour What I would expect to happen is that the ala_occurrences function returns all available records of the species in the target_species data frame and only these species.

Additional context All species in this list are taxonomically valid (currently) even though some are not formally described and all can and have been searched for manually as they are spelt in this list into ALA online and return results correctly. So the records are there they just dont seem to be returning to me for some species. While also returning additional species not on the list in other cases.

mjwestgate commented 2 years ago

Hi! It sounds like the issue here is that select_taxa is not matching the species you want, e.g.

spp <- read.csv("target_species.csv")
taxa <- select_taxa(spp[[1]])

# how many 'incorrect' matches?
length(which(taxa$search_term != taxa$scientific_name)) # n = 15

Looking at these in detail, Diuris pedunculata gives a 'misappliedName' error and defaults to the genus Diuris instead. This might explain some of the issues you describe with unnamed species or subspecies that you don't want; anything in that genus is being returned by your search, regardless of whether it is assigned to a species or subspecies.

Otherwise, the search has returned what the ALA believes is the accepted name for each taxon. As another example, Eucalyptus cannonii returns Eucalyptus macrorhyncha subsp. cannonii (and therefore the species name Eucalyptus macrorhyncha); so the results are different from your input, but are not 'wrong' according to our taxonomic information. It is possible that our taxonomic information is incorrect or out of date; but that isn't a problem with galah per se.

This is important because ala_occurrences indexes records using the taxon_concept_id column from select_taxa, so if the results from select_taxa aren't what you want, the the occurrence records won't be either. So my advice would be to inspect the results from select_taxa before downloading your records.

T-LeB commented 2 years ago

Hi Martin,

Thanks for having a look at this. That makes sense and should be fine for the species that select_taxa is correctly identifying even if they are including more than just the subspecies for instance.

However does that mean i would have to manually download the data for any species incorrectly recognised by select_taxa?

And for the species which are not formally described, which seem to have the correct taxon concept id (usually in the format ALA_Typhonium_sp_aff_brownii rather than https://id.biodiversity.org.au/node/apni/2918275) but return records with NA's in place of a species name is there anyway to get the original name to carry through the process so i can assign the records to the correct species? Or would this have to be done manually as well?

Thanks for your time and help with this and apologies if these questions go beyond the normal bounds of a github issue.

Cheers

Tom

daxkellie commented 2 years ago

Hi Tom,

Unfortunately, yes, I think you might need to download the data for all species recognised by select_taxa() and do some filtering and data cleaning to make sure you are returning the records you want after you run ala_occurrences().

However, I think there might be a misunderstanding with how NAs in place of species names after running select_taxa() affect results from ala_occurrences(). I'll give you an example with the species in the genus Diuris that you mentioned were problematic in your email to ALA support.

First I'll create a target_species tibble including Diuris species names. Then I'll search using select_taxa()

# packages
library(tidyverse)
library(galah)

# For reproducibility, I only used the species from target_species.csv within genus Diuris
target_species <- tibble(ScientificName = c("Diuris aequalis", "Diuris arenaria",
                                            "Diuris bracteata", "Diuris byronensis",
                                            "Diuris disposita", "Diuris eborensis",
                                            "Diuris flavescens", "Diuris pedunculata",
                                            "Diuris praecox", 
                                            "Diuris sp. (Oaklands, D.L. Jones 5380)",
                                            "Diuris venosa"))

# Use select_taxa() to search for species on ALA
taxa <- select_taxa(target_species$ScientificName)

Now if I check to see whether all species names match between target_species and the results returned from select_taxa(), there are 2 that don't match because they have NAs under their species name.

# Are the species in target_species the same as in taxa?

# How many species names from taxa match with target_species?
missing_taxa_in_taxa <- taxa %>% 
  as_tibble() %>% 
  filter(!species %in% target_species$ScientificName)

missing_taxa_in_taxa %>% count() # 2 missing
#> # A tibble: 1 x 1
#>       n
#>   <int>
#> 1     2

missing_taxa_in_taxa %>% select(species)
#> # A tibble: 2 x 1
#>   species
#>   <chr>  
#> 1 <NA>   
#> 2 <NA>

But when I look at the scientific_name column, the first row seems to more broadly refer to Diuris (which might include the species we want), and the second row is a very specific Diuris sp. that we specified in our original target_species list.

In other words, just because there is no species name in the species column doesn't mean the "wrong" result is being returned. The taxon_concept_id is probably still correct for the species, and this can be viewed more clearly in the scientific_name column.

missing_taxa_in_taxa %>% select(scientific_name, species)
#> # A tibble: 2 x 2
#>   scientific_name                        species
#>   <chr>                                  <chr>  
#> 1 Diuris                                 <NA>   
#> 2 Diuris sp. (Oaklands, D.L. Jones 5380) <NA>

Now, I can download the Diuris records using ala_occurrences(). Running the following code returns 149 scientific names.

# Get records
ala_recs_diuris <- ala_occurrences(taxa = taxa)
ala_recs_diuris %>% distinct(scientificName) %>% count() # 149 returned

A quick solution to only include the records you originally searched for is to filter ala_recs_diuris to only return names within target_species.

# filter
ala_recs_filtered <- ala_recs_diuris %>% 
  filter(scientificName %in% target_species$ScientificName)

You might need to double check that you are getting everything you want, though, as this method might be prone to mismatch - you might miss some species if the scientific name in the ALA doesn't match the one you provided. Alternatively, if you are happy with the taxon_concept_ids returned by select_taxa, you could filter by id's instead.

T-LeB commented 2 years ago

Hi Dax

Thanks for that run through, it certainly makes more sense how the select_taxa function is working now. Although this doesnt quite address the problem I was referring to in my last response but I might not have understood where the issue was well enough to explain myself.

From what you've said here the select_taxa function is working as it should and that species column for whatever reason only includes formally described species. However, I think when it passes the list of matched species to the ala_occurrences function something is going wrong and these unnammed species (Bertya sp. (Chambigne NR, M. Fatemi 24), Eucalyptus sp. cattai etc) are getting lost in the process even though there are exact matches being found and they have some value in the taxon_concept_id column.

I wanted to make sure i wasnt just missing something so i ran that missing species code chunk on the full target_species list provided above and modified it so i'd only get the species that werent in my original list rather than the species that had been added in error like all the extra Diuris. For reference:

target_taxa_in_taxa <- ala_recs_names %>% as_tibble() %>% filter(species %in% target_species$ScientificName)

missing_taxa_in_taxa <- target_species %>% as_tibble() %>% filter(!ScientificName %in% target_taxa_in_taxa$species)

There are 50 taxa on the original list that arent in the ala_occurrences output. Some are instances were they are there but under another name that ALA's taxonomy has corrected (e.g. Commersonia procumbens -> Androcalva procumbens). Some are subspecies that ALA has decided to ignore the subspecies (e.g. Boronia inflexa subsp. torringtonensis -> Boronia inflexa). But most are these undescribed species that end up as NA in the species column of the select_taxa output.

To check your filter by taxon_concept_id workaround I looked through all the possible fields that could be included in the output and included every one that sounds relevant to species identification and that you could use in combination with the select_taxa output dataframe to match records to species.

ala_recs <- ala_occurrences(taxa = select_taxa(target_species$Scientific Name), filters = select_filters(stateProvince = "New South Wales", basisOfRecord = c("HUMAN_OBSERVATION","PRESERVED_SPECIMEN")), columns = select_columns("raw_scientificName", "scientificName", "species", "scientificNameID", "taxonConceptID"))

If you filter any of these columns by the original search terms, from target_species list or the matched taxon_concept_ID from the select_taxa output of most of the 50 missing species that are unnamed.

I think this is because when select_taxa is parsed to ala_occurences whichever column is used to sent off to ALA to compile the records has the wrong data in it, which i think must be either the taxon_concept_ID column and/or the species column because it won't recognise the taxon_concept_ID for some of these taxa when its in the format ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24. but it will for others e.g. Eucalyptus sp. Cattai. Apparently, because on their respective ALA pages they do or dont have these identifiers linked (e.g. https://bie.ala.org.au/species/https://id.biodiversity.org.au/node/apni/2892151#names). The subspecies i dont understand because their taxon_concept_id is correct but i guess it might relate to the species column excluding the subspecies.

I understand this probably falls outside of the scope of the actual workings of the Galah package and that theres ongoing work to improve and streamline ALA. But from a user standpoint its pretty opaque and unintuitive when the same search terms return the correct results online, I've only happened to catch this issue by chance and then its taken me getting your help and spending few hours over a couple of days to understand the depth of it and i still have to go and figure out a workaround that can fit into a repeatable workflow.

I appreciate your time and help in working through this though and in general think the package is a massive help in using ALA data.

daxkellie commented 2 years ago

This was an exceptional explanation, Tom. As a result, I was able to get more of an idea about the source of the error and I appreciate you going to the effort of checking this in more detail (and for saying nice things about our package).

First, I'll reproduce the error you're referring to more clearly for documentation.

The error occurs when select_taxa can't find a taxon_concept_id that matches a search term. select_taxa will still return a taxon_concept_id, but it is just a conglomeration of the original term. This, however, is not a valid taxon_concept_id that ala_occurrences() recognises, and will throw an error when run alone.

library(galah)

# galah_config(email = "your-email@email.com")

# search for species using select_taxa
taxa <- select_taxa("Bertya sp. (Chambigne NR, M. Fatemi 24)")

# Get taxon_concept_id
taxa$taxon_concept_id
#> [1] "ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24"

# Pass this id into ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> Error in check_count(count): This query does not match any records.

I've been looking more closely at where the errors are happening in ala_occurrences results and it's not always for the same reasons. Some are caused by some issues with how galah handles syntax in the search terms specified in target_species.

For example, I accidentally noticed that searching the same search terms but without parentheses made select_taxa return the correct taxon_concept_id. This worked for a few other species in target_species that were not returned by ala_occurrences

# with parentheses
taxa <- select_taxa(c("Bertya sp. (Chambigne NR, M. Fatemi 24)",
                      "Bertya sp. (Clouds Creek, M. Fatemi 4)",
                      "Diuris sp. (Oaklands, D.L. Jones 5380)"))

taxa$taxon_concept_id
#> [1] "ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24"
#> [2] "ALA_Bertya_sp_Clouds_Creek_M_Fatemi_4" 
#> [3] "ALA_Diuris_sp_Oaklands_D_L_Jones_5380"

# without parentheses
taxa <- select_taxa(c("Bertya sp. Chambigne NR, M. Fatemi 24",
                      "Bertya sp. Clouds Creek, M. Fatemi 4",
                      "Diuris sp. Oaklands, D.L. Jones 5380"))

taxa$taxon_concept_id
#> [1] "https://id.biodiversity.org.au/node/apni/2892151"  
#> [2] "https://id.biodiversity.org.au/node/apni/2907136"  
#> [3] "https://id.biodiversity.org.au/taxon/apni/51290527"

# Pass this id into ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> This query will return 49062 records

To try this with all the species in target_species, you can filter out parentheses with the following code. This should be a temporary fix for some incorrect taxon_id_concepts until we can correct this in galah.

library(tidyverse)
target_species_edited <- target_species %>%
  mutate(
    ScientificName = str_remove_all(ScientificName, "[()]")
  )

However, using the filtering method that I gave previously using %in% won't let you check if this worked - as I mentioned, it is matching specific terms, and the ALA's species names do not exactly match the ones in target_species. You might have to check these species manually using a string search function like stringr::str_detect() or grepl()

Other missing results from target_species were due to there being no records in the Atlas to find, so nothing was returned by ala_occurrences. Here was one example I noticed : https://bie.ala.org.au/species/ALA_Hibbertia_sp_Bankstown

We'll need to run more tests to find all of the sources of these select_taxa problems. Until then, I'm not certain we can say how to solve all of the problems that are causing your issue, but hopefully this helps temporarily until we can release an update.

T-LeB commented 2 years ago

Thanks for that breakdown and all your time and help on this, i think between what you've explained here and a little manual wrangling i should be able to come up with a work around now.

daxkellie commented 2 years ago

Great news!

For easier reference, here is a summary of the identified issue outside of our discussion:

select_taxa() sometimes does not find correct taxon_concept_ids when a search term is correctly identified as an existing species in the ALA but contains parentheses. Fixing this will improve the flexibility of select_taxa()

# search terms with parentheses
taxa <- select_taxa(c("Bertya sp. (Chambigne NR, M. Fatemi 24)",
                      "Bertya sp. (Clouds Creek, M. Fatemi 4)",
                      "Diuris sp. (Oaklands, D.L. Jones 5380)"))

taxa$taxon_concept_id
#> [1] "ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24"
#> [2] "ALA_Bertya_sp_Clouds_Creek_M_Fatemi_4" 
#> [3] "ALA_Diuris_sp_Oaklands_D_L_Jones_5380"

# Pass IDs to ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> Error in check_count(count): This query does not match any records.

# search terms without parentheses
taxa <- select_taxa(c("Bertya sp. Chambigne NR, M. Fatemi 24",
                      "Bertya sp. Clouds Creek, M. Fatemi 4",
                      "Diuris sp. Oaklands, D.L. Jones 5380"))

taxa$taxon_concept_id
#> [1] "https://id.biodiversity.org.au/node/apni/2892151"  
#> [2] "https://id.biodiversity.org.au/node/apni/2907136"  
#> [3] "https://id.biodiversity.org.au/taxon/apni/51290527"

# Pass IDs to ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> This query will return 49062 records