AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
38 stars 3 forks source link

Wrapping around galah functions #207

Closed fontikar closed 8 months ago

fontikar commented 9 months ago

Describe the bug Hi {galah} team 👋 Firstly I want to say, I love {galah} as an interface to GBIF nodes, I use it all the time for my work 😄 So much so, I wanted to build my own wrapper function around {galah} functions so I don't have to type out the same query every time I was download an update of the data.

My function looks like this: I have noted galah in my DESCRIPTION file to import various functions from galah.

#' Default ALA query
#'
#' @param taxa 
#' @param years 
query <- function(taxa, years){
  galah::galah_call() |> 
  galah::galah_identify(taxa) |> 
  galah::galah_filter(
    spatiallyValid == TRUE, 
    species != "",
    decimalLatitude != "",
    year == years,
    basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
  ) |> 
  galah::galah_select(
    recordID, species, genus, family, decimalLatitude, decimalLongitude, 
    coordinateUncertaintyInMeters, eventDate, datasetName, basisOfRecord, 
    references, institutionCode, recordedBy, outlierLayerCount, isDuplicateOf,sounds
  )
}

Unfortunately, my query() function returns a strange error 😞

Error in `FUN()`:
! Can't subset columns with `galah::galah_filter(...)`.
✖ `galah::galah_filter(...)` must be numeric or character, not a <tbl_df/tbl/data.frame> object.
Run `rlang::last_trace()` to see where the error occurred.

I've tried to do some digging in my reprex below:

galah version

galah_1.5.3

To Reproduce

# Calling galah functions via namespace as you would in writing a wrapper function for an R package

query <- function(taxa, years){
  galah::galah_call() |> 
    galah::galah_identify(taxa) |> 
    galah::galah_filter(
      spatiallyValid == TRUE, 
      species != "",
      decimalLatitude != "",
      year == years,
      basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
    ) |> 
    galah::galah_select(
      recordID, species, genus, family, decimalLatitude, decimalLongitude, 
      coordinateUncertaintyInMeters, eventDate, datasetName, basisOfRecord, 
      references, institutionCode, recordedBy, outlierLayerCount, isDuplicateOf,sounds
    )
}

# Set inputs
taxa = "Orthoptera"
years = seq(1923, 2023)

query(taxa, years)
#> Error in `FUN()`:
#> ! Can't subset columns with `galah::galah_filter(...)`.
#> ✖ `galah::galah_filter(...)` must be numeric or character, not a <tbl_df/tbl/data.frame> object.
#> Backtrace:
#>      â–†
#>   1. ├─global query(taxa, years)
#>   2. │ └─galah::galah_select(...)
#>   3. │   └─galah:::parse_select(dots, group_chosen)
#>   4. │     ├─base::unlist(...)
#>   5. │     └─base::lapply(...)
#>   6. │       └─galah (local) FUN(X[[i]], ...)
#>   7. │         └─tidyselect::eval_select(a, data = df)
#>   8. │           └─tidyselect:::eval_select_impl(...)
#>   9. │             ├─tidyselect:::with_subscript_errors(...)
#>  10. │             │ └─rlang::try_fetch(...)
#>  11. │             │   └─base::withCallingHandlers(...)
#>  12. │             └─tidyselect:::vars_select_eval(...)
#>  13. │               └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
#>  14. │                 └─tidyselect:::as_indices_sel_impl(...)
#>  15. │                   └─tidyselect:::as_indices_impl(...)
#>  16. │                     └─vctrs::vec_as_subscript(x, logical = "error", call = call, arg = arg)
#>  17. └─rlang::cnd_signal(x)

# Break up the galah query, each working independently
# Call
galah::galah_call() |> 
  galah::galah_identify(taxa) 
#> An object of type `data_request` containing:
#> 
#> $identify
#> # A tibble: 1 × 1
#>   identifier                                                               
#>   <chr>                                                                    
#> 1 https://biodiversity.org.au/afd/taxa/0192736e-0955-4830-9977-61e07c843b28

# Filter
  galah::galah_filter(
    spatiallyValid == TRUE, 
    species != "",
    decimalLatitude != "",
    year == years,
    basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
  ) 
#> # A tibble: 5 × 4
#>   variable        logical value                                            query
#>   <chr>           <chr>   <chr>                                            <chr>
#> 1 spatiallyValid  ==      "TRUE"                                           "(sp…
#> 2 species         !=      "\"\""                                           "(sp…
#> 3 decimalLatitude !=      "\"\""                                           "(de…
#> 4 year            ==      "c(\\\"1923\\\", \\\"1924\\\", \\\"1925\\\", \\… "(ye…
#> 5 basisOfRecord   ==      "c(\\\"HUMAN_OBSERVATION\\\", \\\"PRESERVED_SPE… "(ba…

 # Select 
  galah::galah_select(
    recordID, species, genus, family, decimalLatitude, decimalLongitude, 
    coordinateUncertaintyInMeters, eventDate, datasetName, basisOfRecord, 
    references, institutionCode, recordedBy, outlierLayerCount, isDuplicateOf,sounds
  )
#> # A tibble: 16 × 2
#>    name                          type 
#>    <chr>                         <chr>
#>  1 recordID                      field
#>  2 species                       field
#>  3 genus                         field
#>  4 family                        field
#>  5 decimalLatitude               field
#>  6 decimalLongitude              field
#>  7 coordinateUncertaintyInMeters field
#>  8 eventDate                     field
#>  9 datasetName                   field
#> 10 basisOfRecord                 field
#> 11 references                    field
#> 12 institutionCode               field
#> 13 recordedBy                    field
#> 14 outlierLayerCount             field
#> 15 isDuplicateOf                 field
#> 16 sounds                        field

# Start joining the different parts together
# Identify + filter
# Missing identifer info and data_request structure
# Identify + filter
galah::galah_call() |> 
  galah::galah_identify(taxa)  |> 
  galah::galah_filter(
    spatiallyValid == TRUE, 
    species != "",
    decimalLatitude != "",
    year == years,
    basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
  )  
#> # A tibble: 5 × 4
#>   variable        logical value                                            query
#>   <chr>           <chr>   <chr>                                            <chr>
#> 1 spatiallyValid  ==      "TRUE"                                           "(sp…
#> 2 species         !=      "\"\""                                           "(sp…
#> 3 decimalLatitude !=      "\"\""                                           "(de…
#> 4 year            ==      "c(\\\"1923\\\", \\\"1924\\\", \\\"1925\\\", \\… "(ye…
#> 5 basisOfRecord   ==      "c(\\\"HUMAN_OBSERVATION\\\", \\\"PRESERVED_SPE… "(ba…

# Filter + select
    galah::galah_filter(
      spatiallyValid == TRUE, 
      species != "",
      decimalLatitude != "",
      year == years,
      basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
    )  |> 
      galah::galah_select(
        recordID, species, genus, family, decimalLatitude, decimalLongitude, 
        coordinateUncertaintyInMeters, eventDate, datasetName, basisOfRecord, 
        references, institutionCode, recordedBy, outlierLayerCount, isDuplicateOf,sounds
      )
#> Error in `FUN()`:
#> ! Can't subset columns with `galah::galah_filter(...)`.
#> ✖ `galah::galah_filter(...)` must be numeric or character, not a <tbl_df/tbl/data.frame> object.
#> Backtrace:
#>      â–†
#>   1. ├─galah::galah_select(...)
#>   2. │ └─galah:::parse_select(dots, group_chosen)
#>   3. │   ├─base::unlist(...)
#>   4. │   └─base::lapply(...)
#>   5. │     └─galah (local) FUN(X[[i]], ...)
#>   6. │       └─tidyselect::eval_select(a, data = df)
#>   7. │         └─tidyselect:::eval_select_impl(...)
#>   8. │           ├─tidyselect:::with_subscript_errors(...)
#>   9. │           │ └─rlang::try_fetch(...)
#>  10. │           │   └─base::withCallingHandlers(...)
#>  11. │           └─tidyselect:::vars_select_eval(...)
#>  12. │             └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
#>  13. │               └─tidyselect:::as_indices_sel_impl(...)
#>  14. │                 └─tidyselect:::as_indices_impl(...)
#>  15. │                   └─vctrs::vec_as_subscript(x, logical = "error", call = call, arg = arg)
#>  16. └─rlang::cnd_signal(x)

Created on 2023-09-21 with reprex v2.0.2

Expected behaviour

library(galah)
#> 
#> Attaching package: 'galah'
#> The following object is masked from 'package:stats':
#> 
#>     filter

# As is query, calling functions as code and not placed in function
galah_call() |>                               
  galah_identify("Orthoptera") |>   
  galah_filter(
    spatiallyValid == TRUE,
    species != "",
    decimalLatitude != "",
    year == seq(1923, 2023),
    basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
  ) |> 
  galah_select(
    recordID, species, genus, family, decimalLatitude, decimalLongitude, 
    coordinateUncertaintyInMeters, eventDate, datasetName, basisOfRecord, 
    references, institutionCode, recordedBy, outlierLayerCount, isDuplicateOf,sounds
  )
#> An object of type `data_request` containing:
#> 
#> $identify
#> # A tibble: 1 × 1
#>   identifier                                                               
#>   <chr>                                                                    
#> 1 https://biodiversity.org.au/afd/taxa/0192736e-0955-4830-9977-61e07c843b28
#> 
#> $select
#> # A tibble: 16 × 2
#>    name                          type 
#>    <chr>                         <chr>
#>  1 recordID                      field
#>  2 species                       field
#>  3 genus                         field
#>  4 family                        field
#>  5 decimalLatitude               field
#>  6 decimalLongitude              field
#>  7 coordinateUncertaintyInMeters field
#>  8 eventDate                     field
#>  9 datasetName                   field
#> 10 basisOfRecord                 field
#> 11 references                    field
#> 12 institutionCode               field
#> 13 recordedBy                    field
#> 14 outlierLayerCount             field
#> 15 isDuplicateOf                 field
#> 16 sounds                        field
#> 
#> $filter
#> # A tibble: 5 × 4
#>   variable        logical value                                            query
#>   <chr>           <chr>   <chr>                                            <chr>
#> 1 spatiallyValid  ==      "TRUE"                                           "(sp…
#> 2 species         !=      "\"\""                                           "(sp…
#> 3 decimalLatitude !=      "\"\""                                           "(de…
#> 4 year            ==      "c(\\\"1923\\\", \\\"1924\\\", \\\"1925\\\", \\… "(ye…
#> 5 basisOfRecord   ==      "c(\\\"HUMAN_OBSERVATION\\\", \\\"PRESERVED_SPE… "(ba…

Additional context it seems like the data_request object is not being passed from galah_identify to galah_filter and galah_select. I've cross posted this issue in my own repo here

fontikar commented 9 months ago

My current work-around (inspired by @shandiya) is manually joining galah_ query "chunks" as a list

# Create my own data_request object
create_data_request <- function(taxa, years){
  identify <- galah::galah_call() |> 
    galah::galah_identify(taxa)

  filter <- galah::galah_filter(
    spatiallyValid == TRUE, 
    species != "",
    decimalLatitude != "",
    year == years,
    basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
  )

  select <- galah::galah_select(
    recordID, species, genus, family, decimalLatitude, decimalLongitude, 
    coordinateUncertaintyInMeters, eventDate, datasetName, basisOfRecord, 
    references, institutionCode, recordedBy, outlierLayerCount, isDuplicateOf,sounds
  )

  identify$filter <- filter
  identify$select <- select

  return(identify)
}

# Set inputs
taxa = "Orthoptera"
years = seq(1923, 2023)

create_data_request(taxa, years) |> galah::atlas_counts()
#> # A tibble: 1 × 1
#>   count
#>   <int>
#> 1 55936

Created on 2023-09-21 with reprex v2.0.2

mjwestgate commented 9 months ago

Hi Fonti! Thanks heaps for this, and good to hear that galah is proving useful! We're working on the next release right now, and I think this has been solved already. That said, the next version isn't ready yet, so I don't have a fix that I can point you to immediately. Instead I'll put this on our work list and ping here with a solution once it's ready.

re: timelines, we should have something for you to try next week, and are aiming for release in a month or so. M