AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
40 stars 2 forks source link

Select returning too few columns or empty tibbles #239

Closed T-LeB closed 1 month ago

T-LeB commented 4 months ago

Describe the bug I have a brief code chunk (step 3) for downloading species occurrence data in full Darwin Core format, i wrote this and was using it without issue about 6 weeks ago and in that time have not updated R or made any other meaningful changes.

When I run it now it returns a tibble with just 8 columns (recordID, scientificName,taxonConceptID, decimalLat,decimalLong,eventDate,occurrenceStatus,dataResourceName). I figured the default output of atlas_occurrences must've changed and checked the documentation but couldnt find any info on it.

I tried again specifying the columns i was after in atlas_occurrences (see step 4).

This returns an empty tibble. I tried tweaking the galah_select() request and found it would work if I excluded the first three column headers (step 5) (datasetName, catalogNumber, recordedBy), it would also come back empty if any combination of those three was used together (e.g step 6). But if I run it on any of those three headers individually it would work fine (e.g step 7).

No errors or warnings get thrown either.

I can work around this by running the different functional combinations and joining the data, but I cant wrap my head around this behaviour.

galah version Initially in 2.0.1 continued once i updated to 2.0.2

To Reproduce Steps to reproduce

  1. library(galah)

  2. galah_config(email = "", verbose = FALSE)

  3. sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences()

  4. sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(datasetName, catalogNumber, recordedBy, scientificName, eventDate, decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters, verbatimLocality,occurrenceRemarks, habitat,eventRemarks, fieldNotes,basisOfRecord))

  5. sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(scientificName, eventDate, decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters, verbatimLocality,occurrenceRemarks, habitat,eventRemarks, fieldNotes,basisOfRecord))

  6. sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(datasetName,catalogNumber))

  7. sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(datasetName))

Expected behaviour I expect the initial call to return the full darwin core format data for the species or i expect galah_select() to return a full tibble with all the columns requested.

Additional context I have tried to double and triple check i've not missed anything and its not covered by existing issues, so sorry if i have missed something basic.

daxkellie commented 4 months ago

Hi @T-LeB, Thanks for reaching out about this issue. This is some serious super-sleuthing and I was able to replicate the issue that you described, so there is definitely something weird going on here.

We'll have to investigate this further to know exactly what's going on, but I suspect there is something weird happening when parsing the 3 fields you omitted between steps 4 and 5 (datasetName, catalogNumber, recordedBy) when they come first in galah_select(). When I moved those three fields to the end of the list, things seemed to work correctly. What's weird is that (like you mentioned) moving other fields before scientificName didn't result in the same issue.

This isn't a solution to the underlying problem, but to get you up and running again, shifting those fields to the end should allow you to get all the columns you requested.

# Add datasetName, catalogNumber, recordedBy at the end of `galah_select()`
sp_rec <- galah_call() |>
  identify("Swainsona viridis") |>
  select(scientificName, eventDate,
         decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,
         verbatimLocality,occurrenceRemarks, habitat,eventRemarks,
         fieldNotes,basisOfRecord, datasetName, catalogNumber, recordedBy) |>
  atlas_occurrences()
#> Request for 96 occurrences placed in queue
#> Current queue length: 1
#> Downloading

sp_rec
#> # A tibble: 96 × 14
#>    scientificName    eventDate           decimalLatitude decimalLongitude
#>    <chr>             <dttm>                        <dbl>            <dbl>
#>  1 Swainsona viridis 2010-04-01 00:00:00           -31.6             140.
#>  2 Swainsona viridis 1925-10-24 00:00:00            NA                NA 
#>  3 Swainsona viridis NA                            -31.9             141.
#>  4 Swainsona viridis 1967-08-15 00:00:00           -32.6             140.
#>  5 Swainsona viridis NA                            -31.9             141.
#>  6 Swainsona viridis 2010-03-30 00:00:00           -31.7             140.
#>  7 Swainsona viridis NA                            -31.2             140.
#>  8 Swainsona viridis 1938-04-06 00:00:00           -30.7             139.
#>  9 Swainsona viridis 1923-08-22 00:00:00           -31.7             140.
#> 10 Swainsona viridis 1973-09-24 00:00:00            NA                NA 
#> # ℹ 86 more rows
#> # ℹ 10 more variables: coordinateUncertaintyInMeters <dbl>,
#> #   verbatimLocality <chr>, occurrenceRemarks <chr>, habitat <chr>,
#> #   eventRemarks <chr>, fieldNotes <chr>, basisOfRecord <chr>,
#> #   datasetName <chr>, catalogNumber <chr>, recordedBy <chr>
T-LeB commented 4 months ago

Awesome, thanks Dax, that does sound even more confusing. But really appreciate the work around.

Acanthiza commented 1 month ago

Hi @daxkellie,

Have you had any luck with this issue? I'm experiencing a very similar problem, and I'm having no luck with the suggested workaround.

library("galah")
#> galah: version 2.0.2
#> ℹ Default node set to ALA (ala.org.au).
#> ℹ See all supported GBIF nodes with `show_all(atlases)`.
#> ℹ To change nodes, use e.g. `galah_config(atlas = "GBIF")`.
#> Attaching package: 'galah'
#> 
#> The following object is masked from 'package:stats':
#> 
#>     filter

galah_config(email = Sys.getenv("ALA_email")
             , download_reason_id = 10 # testing
             )

example_cols <- c("eventDate", "scientificName")

# example_cols are in 'fields'
example_cols %in% show_all("fields")$id
#> [1] TRUE TRUE

# Initiate a query
qry <- galah_call() |>
  identify("Swainsona viridis")

# Returns a tibble with ~ 100 records
qry |>
  select(group = "basic") |>
  atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> --
#> Downloading
#> # A tibble: 97 × 8
#>    recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
#>    <chr>          <chr>          <chr>                    <dbl>            <dbl>
#>  1 00b4e07f-66f4… Swainsona vir… https://id.bi…           -30.7             140.
#>  2 011eeca6-c623… Swainsona vir… https://id.bi…           -31.7             140.
#>  3 01842ded-071c… Swainsona vir… https://id.bi…            NA                NA 
#>  4 0490897c-a446… Swainsona vir… https://id.bi…            NA                NA 
#>  5 0b753479-40a1… Swainsona vir… https://id.bi…           -31.9             139.
#>  6 0c3a1c3d-dbda… Swainsona vir… https://id.bi…           -31.7             140.
#>  7 10d8312d-b99b… Swainsona vir… https://id.bi…           -30.8             143.
#>  8 116dd8ad-23a1… Swainsona vir… https://id.bi…           -31.6             139.
#>  9 13441a3b-4327… Swainsona vir… https://id.bi…           -31.8             139.
#> 10 19058b8c-a406… Swainsona vir… https://id.bi…           -31               139.
#> # ℹ 87 more rows
#> # ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#> #   dataResourceName <chr>

# No results when directly selecting names
qry |>
  select(c("eventDate", "scientificName")) |>
  atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • eventDate
#> • scientificName
#> # A tibble: 0 × 0

# No results selecting names with all_of
qry |>
  dplyr::select(all_of(example_cols)) |>
  atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • eventDate
#> • scientificName
#> # A tibble: 0 × 0

# No results via request_data
request_data(type = "occurrences") |>
  identify("Swainsona viridis") |>
  select(all_of(example_cols)) |>
  collect()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • eventDate
#> • scientificName
#> # A tibble: 0 × 0

# Try swapping order as per https://github.com/AtlasOfLivingAustralia/galah-R/issues/239
example_cols <- example_cols[c(2, 1)]

# Still no results
qry |>
  dplyr::select(all_of(example_cols)) |>
  atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • scientificName
#> • eventDate
#> # A tibble: 0 × 0

# Try directly providing cols - still nothing
qry |>
  select(c("scientificName", "eventDate")) |>
  atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • scientificName
#> • eventDate
#> # A tibble: 0 × 0

Created on 2024-08-08 with reprex v2.0.2

daxkellie commented 1 month ago

Thanks for the reprex @Acanthiza. We haven't had the chance to investigate the source of this error yet. What I've noticed does fix the problems above, however, is including recordID to the query. I'm not entirely sure why, though...

library(galah)
galah_config(email = "your-email-here")

galah_call() |>
  galah::identify("Swainsona viridis") |>
  galah::select(recordID, eventDate, scientificName) |>
  atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> # A tibble: 97 × 3
#>    recordID                             eventDate           scientificName   
#>    <chr>                                <dttm>              <chr>            
#>  1 00b4e07f-66f4-4f60-88dd-e0ffec964678 1997-07-12 00:00:00 Swainsona viridis
#>  2 011eeca6-c623-4bb7-9447-6af513b43032 1923-08-22 00:00:00 Swainsona viridis
#>  3 01842ded-071c-4d05-93e8-46d0f29344bf NA                  Swainsona viridis
#>  4 0490897c-a446-4802-a8ce-c12230f3aee1 1973-09-29 00:00:00 Swainsona viridis
#>  5 0b753479-40a1-44ff-bf6b-ce815b0a062b 1930-08-26 00:00:00 Swainsona viridis
#>  6 0c3a1c3d-dbda-46ea-b7fa-a49150ebe3e4 2010-03-30 00:00:00 Swainsona viridis
#>  7 10d8312d-b99b-4e6a-b532-cb16f4ff22f5 2001-10-23 00:00:00 Swainsona viridis
#>  8 116dd8ad-23a1-4f86-a40f-0f0edf771749 1963-10-22 00:00:00 Swainsona viridis
#>  9 13441a3b-4327-444e-bedf-663dac3b3af0 2010-03-11 00:00:00 Swainsona viridis
#> 10 19058b8c-a406-4bd5-80e7-838197e2f716 NA                  Swainsona viridis
#> # ℹ 87 more rows

Created on 2024-08-08 with reprex v2.0.2

For now, hopefully this is a workaround. I'll try to get to this issue this week as it seems like it's easy to trigger but difficult to know how to fix.

daxkellie commented 1 month ago

@ZacPentecost Would you mind trying the same fix on Python? The two packages work very similarly, so I'm curious whether this solves your issue as well. If so, I think we'll have to simultaneously update galah python with the same fix

Acanthiza commented 1 month ago

Thanks @daxkellie , including recordID in the select is working for me now, including as

qry |>
  galah::select(recordID, example_cols)
ZacPentecost commented 1 month ago

Hi @daxkellie. I've had a play around and it looks like the error is still present using python when applying a similar fix. I've run your fix in R by including recordID in the select and it looks to be working - thank you.

daxkellie commented 1 month ago

Thanks for testing that out @ZacPentecost

Here's a bug that might need investigating in galah-python @acbuyan

daxkellie commented 1 month ago

The most recent commit appends recordID to a query if recordID is missing (as proposed as a solution in this thread), along with a more helpful message

daxkellie commented 1 month ago

I actually realised that this error does not happen on dev regardless of my change, so reverted back to previous behaviour (which does not append an additional column)