AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
39 stars 3 forks source link

`select_taxa()` breaks when any of the names in `query` are 32 or 36 characters long #23

Closed mjwestgate closed 3 years ago

mjwestgate commented 3 years ago

I found a bug when searching for a large number of taxonomic names at once. First, the error messages appeared wrong, being of the form: No match found for identifier [supplied name here]

More importantly, the resulting data.frame only contains information on issues, i.e.:

'data.frame':   1175 obs. of  1 variable:
 $ issues: chr  "noMatch" "noMatch" "noMatch" "noMatch" ...

Digging into the code, it appears that line 190-193 of select_taxa() automatically interprets strings of length 32 or 36 as indicating that the string in question is an ID, not a taxonomic name:

    # Aus Fungi
    any(nchar(query) == 36) ||
    # CoL
    any(nchar(query) == 32) ||

From the annotation it appears that particular databases have identifiers of this length. However something more specific is needed here, as this behaviour can't be changed by the user. Also in any vector of sufficient length it is likely that at least one string will have 32 or 36 characters long (I found two in a vector of length 1175).

matildastevenson commented 3 years ago

Ah yes I agree this test should be taken out. Could be replaced with a check for numeric characters? It is probably also worth ensuring that in general one name can't corrupt the whole vector- I see two options here: