AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
38 stars 3 forks source link

Better support for filtering out records with assertions #199

Closed daxkellie closed 2 months ago

daxkellie commented 1 year ago

At the moment, you can return records tagged with a type of assertion (i.e. a data quality check) with galah_filter() eg:

galah_call() |>
  galah_identify("passeriformes") |>
  galah_filter(assertions == "RECORDED_DATE_INVALID") |>
  atlas_counts()

However, you cannot filter out these records using galah_filter(assertions != "RECORDED_DATE_INVALID") because an assertion solr query would need to be built in a slightly different way to how galah_filter() builds solr queries normally.

For example, this is the correct solr query to filter out records with INVALID_SCIENTIFIC_NAME

"-assertions:INVALID_SCIENTIFIC_NAME"

But this is how galah_filter() builds the query at the moment:

galah_filter(assertions != "INVALID_SCIENTIFIC_NAME")$query
#> [1] "-(assertions:\"INVALID_SCIENTIFIC_NAME\")"

This very slight difference is enough to mean these queries don't work correctly.

I think it might be possible to support filtering out assertions by checking whether the assertions field has been used in galah_filter(), which will then use a separate, bespoke method to build the correct assertions solr queries?

daxkellie commented 1 year ago

This seems to relate to issue #146

daxkellie commented 1 year ago

This is actually weirder than I originally thought. Looks like it does work sometimes, but not all the time. You can't specify a taxa and add an assertion filter. So I'm not certain whether the issue is with galah_filter() or somewhere else

library(galah)
#> 
#> Attaching package: 'galah'
#> The following object is masked from 'package:stats':
#> 
#>     filter

galah_call() |>
  galah_filter(assertions != "INVALID_SCIENTIFIC_NAME") |>
  galah_group_by(family) |>
  atlas_counts()
#> # A tibble: 6,563 × 2
#>    family         count
#>    <chr>          <int>
#>  1 Meliphagidae 8475943
#>  2 Artamidae    4467487
#>  3 Psittacidae  4263647
#>  4 Anatidae     3944557
#>  5 Columbidae   3235593
#>  6 Acanthizidae 3199128
#>  7 Cacatuidae   3035366
#>  8 Poaceae      2537315
#>  9 Rhipiduridae 2304547
#> 10 Fabaceae     2245147
#> # ℹ 6,553 more rows

galah_call() |>
  galah_identify("psittaciformes") |>
  galah_filter(assertions != "INVALID_SCIENTIFIC_NAME") |>
  galah_group_by(family) |>
  atlas_counts()
#> # A tibble: 1 × 1
#>   count
#>   <dbl>
#> 1     0

galah_call() |>
  galah_filter(order == "psittaciformes",
               assertions != "INVALID_SCIENTIFIC_NAME") |>
  galah_group_by(family) |>
  atlas_counts()
#> # A tibble: 1 × 1
#>   count
#>   <dbl>
#> 1     0

Created on 2023-06-29 with reprex v2.0.2

fontikar commented 9 months ago

I love {galah} and would love to see this feature implemented!

I had some old code where this galah_filter(assertions != "INVALID_SCIENTIFIC_NAME") was possible and and even galah_filter(assertions != c("INVALID_SCIENTIFIC_NAME", "COORDINATE_INVALID"))

Just wanted to upvote this one :D

daxkellie commented 2 months ago

As of version 2.0.2, this feature works experimentally when querying the ALA. There is still some nuance to work out because the API doesn't consistently handle all assertions the exact same way, but it's a start!

library(galah)
#> galah: version 2.0.2
#> ℹ Default node set to ALA (ala.org.au).
#> ℹ See all supported GBIF nodes with `show_all(atlases)`.
#> ℹ To change nodes, use e.g. `galah_config(atlas = "GBIF")`.
#> Attaching package: 'galah'
#> 
#> The following object is masked from 'package:stats':
#> 
#>     filter

galah_call() |>
  identify("psittaciformes") |>
  galah_filter(assertions == "INVALID_SCIENTIFIC_NAME") |>
  galah_group_by(family) |>
  atlas_counts()
#> # A tibble: 2 × 2
#>   family      count
#>   <chr>       <int>
#> 1 Cacatuidae  10527
#> 2 Psittacidae  8023

galah_call() |>
  identify("psittaciformes") |>
  galah_filter(assertions != "INVALID_SCIENTIFIC_NAME") |>
  galah_group_by(family) |>
  atlas_counts()
#> # A tibble: 3 × 2
#>   family        count
#>   <chr>         <int>
#> 1 Psittacidae 4362018
#> 2 Cacatuidae  3102361
#> 3 Nestoridae       94

Created on 2024-04-12 with reprex v2.0.2