PoisotLab / SpeciesDistributionToolkit.jl

Work with species distributions in Julia
https://poisotlab.github.io/SpeciesDistributionToolkit.jl/
Other
13 stars 2 forks source link

DuckDB #205

Open tpoisot opened 11 months ago

tpoisot commented 11 months ago

Tagging @cboettig - would you be open to sharing the code snippet you showed me after the workshop?

cboettig commented 11 months ago

I dropped an example in https://github.com/tpoisot/InterpretableSDMWithJulia/pull/13 but maybe would be easier to paste in the thread.

A few extra notes:


using DuckDB
# create a new in-memory database
con = DBInterface.connect(DuckDB.DB, ":memory:")

# create a table
DBInterface.execute(con, "INSTALL httpfs;")
DBInterface.execute(con, "LOAD httpfs;")

DBInterface.execute(con, 
"CREATE OR REPLACE TEMPORARY VIEW gbif 
 AS SELECT * FROM 
 read_parquet('s3://gbif-open-data-us-east-1/occurrence/2023-10-01/occurrence.parquet/**')
 ;")

# Note that we can do aribtrary SQL operations, including
# geoparquet spatial queries, with minimal RAM footprint,
# over the full gbif

results = DBInterface.execute(con, 
"SELECT *
FROM gbif
WHERE
  (scientificname = 'Procyon lotor') AND
  (decimallatitude BETWEEN -80.0 AND -58.501) AND
  (decimallongitude BETWEEN 41.0 AND 51.999)")
tpoisot commented 11 months ago

I definitely need to read more about this, but if it is a viable alternative to the streaming API, I'm all for it. Do you know if there's a similar endpoint for the taxonomy?

cboettig commented 11 months ago

Looks like that's the only one they post to AWS open data, but we could ask the GBIF team about that. The current taxonomic backbone (and other snapshots) can be downloaded as a tsv already from https://hosted-datasets.gbif.org/datasets/backbone/current/ -- if it weren't in a zip archive we could hit it directly from duckdb (gz compression would have been fine).

For now I just dropped that copy on my MINIO server so I think this should work:


DBInterface.execute(con, 
"CREATE OR REPLACE TEMPORARY VIEW taxon 
AS SELECT * FROM 
read_csv('https://minio.carlboettiger.info/gbif/Taxon.tsv', 
         delim='\\t', header=true, quote='',  auto_detect=true);")

taxon = DBInterface.execute(con, "SELECT * FROM Taxon")