Open tpoisot opened 11 months ago
I dropped an example in https://github.com/tpoisot/InterpretableSDMWithJulia/pull/13 but maybe would be easier to paste in the thread.
A few extra notes:
duckdb
can do this remote access on http urls as well, not just S3 (arrow
does some similar stuff but without full SQL and only for S3). It also works with CSV files, not just parquet, and the read method can take a vector of file paths. This is nice, for instance, because with NEON on Google Cloud Storage now, we can have a remote connection to all the csv files in some data product without downloading a single one. And of course you can do authenticated S3 access when the data can't be re-distributed (looking at you, eBird)
using DuckDB
# create a new in-memory database
con = DBInterface.connect(DuckDB.DB, ":memory:")
# create a table
DBInterface.execute(con, "INSTALL httpfs;")
DBInterface.execute(con, "LOAD httpfs;")
DBInterface.execute(con,
"CREATE OR REPLACE TEMPORARY VIEW gbif
AS SELECT * FROM
read_parquet('s3://gbif-open-data-us-east-1/occurrence/2023-10-01/occurrence.parquet/**')
;")
# Note that we can do aribtrary SQL operations, including
# geoparquet spatial queries, with minimal RAM footprint,
# over the full gbif
results = DBInterface.execute(con,
"SELECT *
FROM gbif
WHERE
(scientificname = 'Procyon lotor') AND
(decimallatitude BETWEEN -80.0 AND -58.501) AND
(decimallongitude BETWEEN 41.0 AND 51.999)")
I definitely need to read more about this, but if it is a viable alternative to the streaming API, I'm all for it. Do you know if there's a similar endpoint for the taxonomy?
Looks like that's the only one they post to AWS open data, but we could ask the GBIF team about that. The current taxonomic backbone (and other snapshots) can be downloaded as a tsv
already from https://hosted-datasets.gbif.org/datasets/backbone/current/ -- if it weren't in a zip archive we could hit it directly from duckdb (gz compression would have been fine).
For now I just dropped that copy on my MINIO server so I think this should work:
DBInterface.execute(con,
"CREATE OR REPLACE TEMPORARY VIEW taxon
AS SELECT * FROM
read_csv('https://minio.carlboettiger.info/gbif/Taxon.tsv',
delim='\\t', header=true, quote='', auto_detect=true);")
taxon = DBInterface.execute(con, "SELECT * FROM Taxon")
Tagging @cboettig - would you be open to sharing the code snippet you showed me after the workshop?