AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
38 stars 3 forks source link

`select_filters()` cannot create queries in more complex loops #90

Closed daxkellie closed 2 years ago

daxkellie commented 2 years ago

select_filters() can be used to filter queries to the API, and is a central part of the ala_ group of functions. However, when trying to run more complex loops to group specific taxa, filters and categories, select_filters() can cause issues.

For example, say I want to extract counts of occurrence records by year using ala_counts(). We can create a filter to extract a sequence of years from 1999 to 2003. Notice that select_filters() creates a data.frame to build a query

# packages
library(galah)
library(tidyverse)
#> Warning: package 'readr' was built under R version 4.1.1
library(purrr)

# select sequence of years as filter
select_filters(year = seq(1999, 2001)) # NOTE: this creates a data.frame
#>   variable logical          value                                       query
#> 1     year       = 1999,2000,2001 (year:"1999" OR year:"2000" OR year:"2001")

Now let's try adding the sequence of years to an object

years <- seq(1999, 2001)
select_filters(year = years)
#> Error in switch(df$logical, `=` = {: EXPR must be a length 1 vector

This doesn't seem to work.

Building full request using paste0() doesn't work either

years_short_command <- paste0("year = ", years, sep="")
years_short_command
#> [1] "year = 1999" "year = 2000" "year = 2001"

select_filters(years_short_command)
#> Error in if (filter_name %in% search_fields(type = "assertions")$id) {: argument is of length zero

This limitation to select_filters() becomes a problem when wanting to run more complex loops with functions like ala_counts(). If I want to get counts of occurrence records by kingdom and by year, I can run a loop using purrr::map() like so

# Extract kingdom names
kingdoms <- ala_counts(group_by = "kingdom", limit = 10)
kingdom_names <- pull(kingdoms, kingdom)

# loop counts for each kingdom and year
kingdom_names %>%
  map(~ala_counts(
    taxa = select_taxa(list(kingdom = .x)),
    filters = select_filters(year = seq(1999, 2001)),
    group_by = "year")) %>% 
  tibble(
    kingdom = kingdom_names,
    y = .) %>% 
  unnest(y) %>%
  select(-name)
#> # A tibble: 22 x 3
#>    kingdom   year    count
#>    <chr>     <chr>   <dbl>
#>  1 Animalia  2000  2126159
#>  2 Animalia  2001  2023781
#>  3 Animalia  1999  2001295
#>  4 Plantae   2001   642087
#>  5 Plantae   1999   621074
#>  6 Plantae   2000   463696
#>  7 Fungi     2001    18140
#>  8 Fungi     1999    14846
#>  9 Fungi     2000    12739
#> 10 Chromista 2000    34163
#> # ... with 12 more rows

Nice! But I can't use the same technique I used above with select_taxa() to loop within select_filters(). For example, say I want to get the counts of observations from different data providers by year

years <- seq(1999, 2001)
years %>%
  map(~ala_counts(
    filters = select_filters(list(year = .x)),
    group_by = "dataResourceName",
    limit = 10))
#> Error in if (filter_name %in% search_fields(type = "assertions")$id) {: argument is of length zero

No dice

Building the full argument including the select_filters() call, however, does work

years_long_command <- paste0("select_filters(year = ", years, ")", sep="")
years_long_command %>%
  map(~ala_counts(
    list(filters = .x),
    group_by = "dataResourceName",
    limit = 10)) %>%
  tibble(
    year = years,
    y = .) %>%
  unnest(y)
#> Warning in ala_counts(list(filters = .x), group_by = "dataResourceName", : This
#> field has 288 values. 10 will be returned. Change `limit` to return more values.
#> Warning in ala_counts(list(filters = .x), group_by = "dataResourceName", : This
#> field has 277 values. 10 will be returned. Change `limit` to return more values.
#> Warning in ala_counts(list(filters = .x), group_by = "dataResourceName", : This
#> field has 258 values. 10 will be returned. Change `limit` to return more values.
#> # A tibble: 30 x 3
#>     year dataResourceName                                                  count
#>    <int> <chr>                                                             <int>
#>  1  1999 Pelagic Fish Observations 1968-1999                               54059
#>  2  1999 Salinity Action Plan Flora Survey                                 14831
#>  3  1999 Western Australia Bird Surveys (1987-2015)                        12894
#>  4  1999 APIS - Antarctic Pack Ice Seals 1994-1999, plus historical data ~  9271
#>  5  1999 Western Australian Museum provider for OZCAM                       3187
#>  6  1999 Australian Museum provider for OZCAM                               3088
#>  7  1999 eBird Australia                                                    2421
#>  8  1999 ARGOS Satellite Tracking of animals                                1720
#>  9  1999 Seabird observations during long-line fisheries operations in wa~  1713
#> 10  1999 Museums Victoria provider for OZCAM                                1577
#> # ... with 20 more rows

But what if I want to extract counts by kingdoms and by year, and then group by a third category (in this case data providers)?

# data.frame of all combinations
kingdom_and_years <- crossing(kingdom_names, years_long_command)
kingdom_and_years %>% slice(1:7)
#> # A tibble: 7 x 2
#>   kingdom_names years_long_command         
#>   <chr>         <chr>                      
#> 1 Animalia      select_filters(year = 1999)
#> 2 Animalia      select_filters(year = 2000)
#> 3 Animalia      select_filters(year = 2001)
#> 4 Archaea       select_filters(year = 1999)
#> 5 Archaea       select_filters(year = 2000)
#> 6 Archaea       select_filters(year = 2001)
#> 7 Bacteria      select_filters(year = 1999)

# function to get counts
extract_yearly_counts <- function(x, y){ 
  ala_counts(
    taxa = select_taxa(list(kingdom = x)),
    list(data.frame(filters = y)),
    group_by = "dataResourceName",
    limit = 10)
}

# loop function with purrr::map2()
kingdom_and_years %>% 
  mutate(x = map2(.$kingdom_names, 
                  .$years_long_command, 
                  extract_yearly_counts))
#> Error: Problem with `mutate()` column `x`.
#> i `x = map2(.$kingdom_names, .$years_long_command, extract_yearly_counts)`.
#> x filters is not a data frame

Yet again, no dice.

Notice that the error suggests that the filters are not in data frame. It seems that when looped in this way, select_filters() cannot create each query data.frame and fails

daxkellie commented 2 years ago

After discussion, we have noticed that this issue could be solved by optimising the group_by argument in ala_counts() for multiple arguments. This might be a better solution than changing the behaviour of select_filters()

/cc @mjwestgate

daxkellie commented 2 years ago

galah_group_by function accepts multiple arguments and prevents the need for looping to achieve the same results as above