This repo is an example of using duckdb and dbt to build data.
The code is organized into three stages to show a progression of from simple queries to a modern data pipeline.
A toy dataset was created for the purposes of this repo called Farm To Market: a dataset about local agriculture.
Farm To Market captures where local food is sold and might be grown in New York City.
This dataset combines the locations of farmers markets and potential farms (community gardens) to highlight availability and potential local suppliers of healthy produce in NYC.
[!WARNING] This dataset's design is in-progress and the data dictionary below may not represent the final columns of a build.
farm_to_market
table
Each row is a market and farm pair
market_name
farm_name
suitability_score
distance_ft
line_geometry_wgs84
markets
table
Each row is a farmers market
market_name
accepts_ebt
distributes_health_bucks
open_year_round
address
borough
latitude
longitude
geometry_is_in_nyc
point_geometry_wgs84
farms
table
Each row is a potential farm
type
farm_name
area_sqft
whole_lot
address
borough
bbl
latitude
longitude
point_geometry_wgs84
polygon_geometry_wgs84
Source data:
bash
terminalSetup a python virtual environment named .venv
either using python -m venv .venv
or using the VS Code command Python: create environment
Activate the virtual environment
source .venv/Scripts/activate
Install packages and confirm setup
python -m pip install --requirement setup/requirements.txt
pip list
[!IMPORTANT] For NYC government employees, most work computers access the internet through a firewall. This causes issues when writing and running code (e.g.
pip install
fails).See instructions below to allow code to access the internet on a city-issued PC.
Search for an select
Edit environment variables for your account
in the Start menuAdd two environment variables with the same value of
http://bcpxy.nycnet:8080
:
http_proxy
https_proxy
Run a python script to confirm everything is setup
python -m stage_0.sanity_check
Load and explore data from various sources
These datasets were chosen to show some of the ways source data can be imported with DuckDB.
Download PLUTO from NYC Open Data here by navigating to Actions
-> API
-> Download file
-> Export format: CSV
Rename the downloaded csv file to pluto.csv
and move it to data/source_data/
.
Run a python script to download the other 2 datasets programmatically and load all source data into a database:
python -m stage_1.load
Use the Jupyter notebook stage_1/explore.ipynb
to explore the data
Use a data pipeline to build Farm To Market
Download all source data from their Open Data pages by navigating to Actions
-> API
-> Download file
-> Export format: CSV
[!TIP] NYC Borough Boundaries must be downloaded as a geojson file by navigating to
Export
->Download Geospatial Data
->GeoJSON
.
Move the downloaded csv file to data/source_data/
. Run a python script to load all source data into a database:
python -m stage_2.load
(Optional) Use the Jupyter notebook stage_2/explore.ipynb
to explore the source data
Run python scripts to transform and export data:
python -m stage_2.transform
python -m stage_2.export
Use the Jupyter notebook stage_2/analyze.ipynb
to review and analyze the dataset
Use dbt to build to build Farm To Market
Install dbt packages and confirm setup:
dbt deps
dbt debug
Test source data:
dbt test --select "source:*"
Build the dataset:
dbt build