AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
39 stars 3 forks source link

Should we implement fuzzy matching or case-insensitivity for `galah_` functions? #134

Closed mjwestgate closed 8 months ago

mjwestgate commented 2 years ago

search_ functions in galah are deliberately case insensitive. Internally we use tolower() on both the query and the pattern before running grepl(), allowing any combination of upper and lower case to succeed:

search_fields("WoRlDcLiM")
# A tibble: 38 × 4
   id      description                                                                     type   link                  
   <chr>   <chr>                                                                           <chr>  <chr>                 
 1 el10982 WorldClim 2.1: Temperature - warmest month max Max Temperature of Warmest Month layers https://www.worldclim…
 2 el10981 WorldClim 2.1: Temperature - seasonality Temperature Seasonality                layers https://www.worldclim…
 3 el10980 WorldClim 2.1: Temperature - isothermality Isothermality                        layers https://www.worldclim…
 4 el10990 WorldClim 2.1: Precipitation - wettest month Precipitation of Wettest Month     layers https://www.worldclim…

In contrast, field names are case sensitive in galah_ functions, which can be challenging when field names are camel case, e.g.

galah_call() %>%
  galah_identify(Heleioporus) %>%
  galah_group_by(taxonConceptId) %>%   # actual field name is "taxonConceptID"
  atlas_counts()
# A tibble: 1 × 1
  count
  <int>
1  6683
Warning message:
Invalid field(s) detected.
ℹ See a listing of all valid fields with `show_all_fields()`.
ℹ Search for the valid name of a desired field with `search_fields()`.
✖ Invalid field(s): taxonConceptId. 

It would be fairly straightforward to support case-insensitivity within galah_filter, galah_select and galah_group_by; or even to go a step further and use agrep() for fuzzy matching (with some care needed).

daxkellie commented 1 year ago

I think this could be useful, but the error message is pretty clear to the user to double check the column name. I think in general, tidy syntax is consistent that fuzzy matching takes place when names are in quotes (like in many {stringr} string detect functions), whereas column names outside of quotes are case-sensitive (like dplyr::select() or dplyr::filter). So, I think this can be left without action for the time being

mjwestgate commented 8 months ago

Agree this is more risky than useful