Closed T-LeB closed 1 month ago
Hi @T-LeB, Thanks for reaching out about this issue. This is some serious super-sleuthing and I was able to replicate the issue that you described, so there is definitely something weird going on here.
We'll have to investigate this further to know exactly what's going on, but I suspect there is something weird happening when parsing the 3 fields you omitted between steps 4 and 5 (datasetName
, catalogNumber
, recordedBy
) when they come first in galah_select()
. When I moved those three fields to the end of the list, things seemed to work correctly. What's weird is that (like you mentioned) moving other fields before scientificName
didn't result in the same issue.
This isn't a solution to the underlying problem, but to get you up and running again, shifting those fields to the end should allow you to get all the columns you requested.
# Add datasetName, catalogNumber, recordedBy at the end of `galah_select()`
sp_rec <- galah_call() |>
identify("Swainsona viridis") |>
select(scientificName, eventDate,
decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,
verbatimLocality,occurrenceRemarks, habitat,eventRemarks,
fieldNotes,basisOfRecord, datasetName, catalogNumber, recordedBy) |>
atlas_occurrences()
#> Request for 96 occurrences placed in queue
#> Current queue length: 1
#> Downloading
sp_rec
#> # A tibble: 96 × 14
#> scientificName eventDate decimalLatitude decimalLongitude
#> <chr> <dttm> <dbl> <dbl>
#> 1 Swainsona viridis 2010-04-01 00:00:00 -31.6 140.
#> 2 Swainsona viridis 1925-10-24 00:00:00 NA NA
#> 3 Swainsona viridis NA -31.9 141.
#> 4 Swainsona viridis 1967-08-15 00:00:00 -32.6 140.
#> 5 Swainsona viridis NA -31.9 141.
#> 6 Swainsona viridis 2010-03-30 00:00:00 -31.7 140.
#> 7 Swainsona viridis NA -31.2 140.
#> 8 Swainsona viridis 1938-04-06 00:00:00 -30.7 139.
#> 9 Swainsona viridis 1923-08-22 00:00:00 -31.7 140.
#> 10 Swainsona viridis 1973-09-24 00:00:00 NA NA
#> # ℹ 86 more rows
#> # ℹ 10 more variables: coordinateUncertaintyInMeters <dbl>,
#> # verbatimLocality <chr>, occurrenceRemarks <chr>, habitat <chr>,
#> # eventRemarks <chr>, fieldNotes <chr>, basisOfRecord <chr>,
#> # datasetName <chr>, catalogNumber <chr>, recordedBy <chr>
Awesome, thanks Dax, that does sound even more confusing. But really appreciate the work around.
Hi @daxkellie,
Have you had any luck with this issue? I'm experiencing a very similar problem, and I'm having no luck with the suggested workaround.
library("galah")
#> galah: version 2.0.2
#> ℹ Default node set to ALA (ala.org.au).
#> ℹ See all supported GBIF nodes with `show_all(atlases)`.
#> ℹ To change nodes, use e.g. `galah_config(atlas = "GBIF")`.
#> Attaching package: 'galah'
#>
#> The following object is masked from 'package:stats':
#>
#> filter
galah_config(email = Sys.getenv("ALA_email")
, download_reason_id = 10 # testing
)
example_cols <- c("eventDate", "scientificName")
# example_cols are in 'fields'
example_cols %in% show_all("fields")$id
#> [1] TRUE TRUE
# Initiate a query
qry <- galah_call() |>
identify("Swainsona viridis")
# Returns a tibble with ~ 100 records
qry |>
select(group = "basic") |>
atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> --
#> Downloading
#> # A tibble: 97 × 8
#> recordID scientificName taxonConceptID decimalLatitude decimalLongitude
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 00b4e07f-66f4… Swainsona vir… https://id.bi… -30.7 140.
#> 2 011eeca6-c623… Swainsona vir… https://id.bi… -31.7 140.
#> 3 01842ded-071c… Swainsona vir… https://id.bi… NA NA
#> 4 0490897c-a446… Swainsona vir… https://id.bi… NA NA
#> 5 0b753479-40a1… Swainsona vir… https://id.bi… -31.9 139.
#> 6 0c3a1c3d-dbda… Swainsona vir… https://id.bi… -31.7 140.
#> 7 10d8312d-b99b… Swainsona vir… https://id.bi… -30.8 143.
#> 8 116dd8ad-23a1… Swainsona vir… https://id.bi… -31.6 139.
#> 9 13441a3b-4327… Swainsona vir… https://id.bi… -31.8 139.
#> 10 19058b8c-a406… Swainsona vir… https://id.bi… -31 139.
#> # ℹ 87 more rows
#> # ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#> # dataResourceName <chr>
# No results when directly selecting names
qry |>
select(c("eventDate", "scientificName")) |>
atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • eventDate
#> • scientificName
#> # A tibble: 0 × 0
# No results selecting names with all_of
qry |>
dplyr::select(all_of(example_cols)) |>
atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • eventDate
#> • scientificName
#> # A tibble: 0 × 0
# No results via request_data
request_data(type = "occurrences") |>
identify("Swainsona viridis") |>
select(all_of(example_cols)) |>
collect()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • eventDate
#> • scientificName
#> # A tibble: 0 × 0
# Try swapping order as per https://github.com/AtlasOfLivingAustralia/galah-R/issues/239
example_cols <- example_cols[c(2, 1)]
# Still no results
qry |>
dplyr::select(all_of(example_cols)) |>
atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • scientificName
#> • eventDate
#> # A tibble: 0 × 0
# Try directly providing cols - still nothing
qry |>
select(c("scientificName", "eventDate")) |>
atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> Warning: The following fields, requested in your query, were not downloaded:
#> • scientificName
#> • eventDate
#> # A tibble: 0 × 0
Created on 2024-08-08 with reprex v2.0.2
Thanks for the reprex @Acanthiza. We haven't had the chance to investigate the source of this error yet. What I've noticed does fix the problems above, however, is including recordID
to the query. I'm not entirely sure why, though...
library(galah)
galah_config(email = "your-email-here")
galah_call() |>
galah::identify("Swainsona viridis") |>
galah::select(recordID, eventDate, scientificName) |>
atlas_occurrences()
#> Request for 97 occurrences placed in queue
#> Current queue length: 1
#> Downloading
#> # A tibble: 97 × 3
#> recordID eventDate scientificName
#> <chr> <dttm> <chr>
#> 1 00b4e07f-66f4-4f60-88dd-e0ffec964678 1997-07-12 00:00:00 Swainsona viridis
#> 2 011eeca6-c623-4bb7-9447-6af513b43032 1923-08-22 00:00:00 Swainsona viridis
#> 3 01842ded-071c-4d05-93e8-46d0f29344bf NA Swainsona viridis
#> 4 0490897c-a446-4802-a8ce-c12230f3aee1 1973-09-29 00:00:00 Swainsona viridis
#> 5 0b753479-40a1-44ff-bf6b-ce815b0a062b 1930-08-26 00:00:00 Swainsona viridis
#> 6 0c3a1c3d-dbda-46ea-b7fa-a49150ebe3e4 2010-03-30 00:00:00 Swainsona viridis
#> 7 10d8312d-b99b-4e6a-b532-cb16f4ff22f5 2001-10-23 00:00:00 Swainsona viridis
#> 8 116dd8ad-23a1-4f86-a40f-0f0edf771749 1963-10-22 00:00:00 Swainsona viridis
#> 9 13441a3b-4327-444e-bedf-663dac3b3af0 2010-03-11 00:00:00 Swainsona viridis
#> 10 19058b8c-a406-4bd5-80e7-838197e2f716 NA Swainsona viridis
#> # ℹ 87 more rows
Created on 2024-08-08 with reprex v2.0.2
For now, hopefully this is a workaround. I'll try to get to this issue this week as it seems like it's easy to trigger but difficult to know how to fix.
@ZacPentecost Would you mind trying the same fix on Python? The two packages work very similarly, so I'm curious whether this solves your issue as well. If so, I think we'll have to simultaneously update galah python with the same fix
Thanks @daxkellie , including recordID in the select is working for me now, including as
qry |>
galah::select(recordID, example_cols)
Hi @daxkellie. I've had a play around and it looks like the error is still present using python when applying a similar fix. I've run your fix in R by including recordID in the select and it looks to be working - thank you.
Thanks for testing that out @ZacPentecost
Here's a bug that might need investigating in galah-python @acbuyan
The most recent commit appends recordID
to a query if recordID
is missing (as proposed as a solution in this thread), along with a more helpful message
I actually realised that this error does not happen on dev
regardless of my change, so reverted back to previous behaviour (which does not append an additional column)
Describe the bug I have a brief code chunk (step 3) for downloading species occurrence data in full Darwin Core format, i wrote this and was using it without issue about 6 weeks ago and in that time have not updated R or made any other meaningful changes.
When I run it now it returns a tibble with just 8 columns (recordID, scientificName,taxonConceptID, decimalLat,decimalLong,eventDate,occurrenceStatus,dataResourceName). I figured the default output of atlas_occurrences must've changed and checked the documentation but couldnt find any info on it.
I tried again specifying the columns i was after in atlas_occurrences (see step 4).
This returns an empty tibble. I tried tweaking the galah_select() request and found it would work if I excluded the first three column headers (step 5) (datasetName, catalogNumber, recordedBy), it would also come back empty if any combination of those three was used together (e.g step 6). But if I run it on any of those three headers individually it would work fine (e.g step 7).
No errors or warnings get thrown either.
I can work around this by running the different functional combinations and joining the data, but I cant wrap my head around this behaviour.
galah version Initially in 2.0.1 continued once i updated to 2.0.2
To Reproduce Steps to reproduce
library(galah)
galah_config(email = "", verbose = FALSE)
sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences()
sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(datasetName, catalogNumber, recordedBy, scientificName, eventDate, decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters, verbatimLocality,occurrenceRemarks, habitat,eventRemarks, fieldNotes,basisOfRecord))
sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(scientificName, eventDate, decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters, verbatimLocality,occurrenceRemarks, habitat,eventRemarks, fieldNotes,basisOfRecord))
sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(datasetName,catalogNumber))
sp_rec <- galah_call() |> galah_identify("Swainsona viridis") |> atlas_occurrences(select = galah_select(datasetName))
Expected behaviour I expect the initial call to return the full darwin core format data for the species or i expect galah_select() to return a full tibble with all the columns requested.
Additional context I have tried to double and triple check i've not missed anything and its not covered by existing issues, so sorry if i have missed something basic.