Open jaanli opened 1 year ago
Hey @jaanli, thanks for the nice note and for the teaching you do! So I cloned and ran the project you linked to locally without any issues afaict:
jwills@Joshs-MBP dev % git clone git@github.com:onefact/data_build_tool_for_datathinking.org.git
Cloning into 'data_build_tool_for_datathinking.org'...
remote: Enumerating objects: 41, done.
remote: Counting objects: 100% (41/41), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 41 (delta 9), reused 33 (delta 5), pack-reused 0
Receiving objects: 100% (41/41), 17.93 KiB | 1.99 MiB/s, done.
Resolving deltas: 100% (9/9), done.
jwills@Joshs-MBP dev % cd data_build_tool_for_datathinking.org
jwills@Joshs-MBP data_build_tool_for_datathinking.org % ls
LICENSE README.md datathinking dbt_output
jwills@Joshs-MBP data_build_tool_for_datathinking.org % cd datathinking
jwills@Joshs-MBP datathinking % ls
README.md analyses dbt_project.yml macros models profiles.yml seeds snapshots tests
jwills@Joshs-MBP datathinking % dbt debug
16:51:47 Running with dbt=1.5.0
16:51:47 dbt version: 1.5.0
16:51:47 python version: 3.11.3
16:51:47 python path: /opt/homebrew/opt/python@3.11/bin/python3.11
16:51:47 os info: macOS-13.4-arm64-arm-64bit
16:51:47 Using profiles.yml file at /Users/jwills/dev/data_build_tool_for_datathinking.org/datathinking/profiles.yml
16:51:47 Using dbt_project.yml file at /Users/jwills/dev/data_build_tool_for_datathinking.org/datathinking/dbt_project.yml
16:51:47 Configuration:
16:51:47 profiles.yml file [OK found and valid]
16:51:47 dbt_project.yml file [OK found and valid]
16:51:47 Required dependencies:
16:51:47 - git [OK found]
16:51:47 Connection:
16:51:47 database: new_york_city_311_calls
16:51:47 schema: main
16:51:47 path: /tmp/new_york_city_311_calls.duckdb
16:51:52 Connection test: [OK connection ok]
16:51:52 All checks passed!
jwills@Joshs-MBP datathinking % dbt run
16:51:57 Running with dbt=1.5.0
16:51:57 Unable to do partial parsing because saved manifest not found. Starting full parse.
16:51:57 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.datathinking.example
16:51:57 Found 1 model, 0 tests, 0 snapshots, 0 analyses, 315 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
16:51:57
16:51:57 Concurrency: 1 threads (target='dev')
16:51:57
16:51:57 1 of 1 START sql view model main.new_york_city_311_calls ....................... [RUN]
16:51:57 1 of 1 OK created sql view model main.new_york_city_311_calls .................. [OK in 0.04s]
16:51:57
16:51:57 Finished running 1 view model in 0 hours 0 minutes and 0.13 seconds (0.13s).
16:51:57
16:51:57 Completed successfully
16:51:57
16:51:57 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
jwills@Joshs-MBP datathinking % ls -ltr /tmp
lrwxr-xr-x@ 1 root wheel 11 May 12 15:29 /tmp -> private/tmp
jwills@Joshs-MBP datathinking % ls -ltr /tmp/new_york_city_311_calls.duckdb
-rw-r--r-- 1 jwills wheel 274432 Jun 23 09:51 /tmp/new_york_city_311_calls.duckdb
jwills@Joshs-MBP datathinking % duckdb /tmp/new_york_city_311_calls.duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────┐
│ name │
│ varchar │
├─────────────────────────┤
│ new_york_city_311_calls │
└─────────────────────────┘
Note that the DuckDB database file I am loading is the one that is at /tmp/new_york_city_311_calls.duckdb
, which is the path that is specified in profiles.yml
right now on the master branch, so the first thing to check is to see if the view of that CSV file that you created is in that file instead? Assuming that isn't the issue (i.e., you have locally updated your profiles.yml
so as to point at that dropbox directory you showed in your message), I would want to know a couple of things:
1) What version of DuckDB you're running with (both for the CLI and for your Python installation, since they aren't necessarily the same),
2) If the dbt run
call works correctly if the path is set to /tmp/new_york_city_311_calls.duckdb
instead of (what I have to assume is) a Dropbox-managed directory.
I'm not aware of any issues on writing to DuckDB database files in Dropbox-managed directories, but it's totally possible there is some conflict there that is messing things up for you.
Oh interesting, thank you so much @jwills !
Yup I see what you mean, my bad on being unclear in the original description.
I think I got to that stage too.
The specific issue now that you've helped clarify what may be going on:
There doesn't seem to be many parts of the data in https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD in the /tmp/new_york_city_311_calls.duckdb
duckdb database file that appears after executing dbt run
.
For example, I see the same output you do for show tables;
, but I can't select any rows (this hangs indefinitely, at least 10-20 minutes of waiting so far!):
me@laptop ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking main duckdb /tmp/new_york_city_311_calls.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────┐
│ name │
│ varchar │
├─────────────────────────┤
│ new_york_city_311_calls │
└─────────────────────────┘
D SELECT * FROM new_york_city_311_calls LIMIT 10;
Further, https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD is 18 gigabytes, but /tmp/new_york_city_311_calls.duckdb
is 268 kilobytes so that is why something seems to be off here (or, more likely, I'm not understanding how dbt
or duckdb
work after having taught it and related concepts for over a semester!! lol :S):
me@laptop ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking main ls -lh /tmp/new_york_city_311_calls.duckdb
-rw-r--r-- 1 me wheel 268K Jun 24 12:48 /tmp/new_york_city_311_calls.duckdb
Is that helpful to maybe try pinpointing the issue a little further? Appreciate your help!!! Feels like we're getting closer 🍡
Yeah, I noticed that your default materialization in the project was set to the dbt default, which is a view
. That explains the small size of the DuckDB file you're creating in the project, as it is only storing the query in the DB, not the actual data. If you want it to pull all of the data over to your local instance, you need to configure the model to be materialized as a table
.
The indefinite hang doesn't shock me either-- I just copy-pasted that nyc url into a browser, and it kicked off a download process that needs to pull down the 18GB before the query will be able to run against it. You'd almost certainly be better off pulling that data down once and storing it using another external source (like S3) which can start streaming the data to you immediately when a query is run.
Hope that helps a bit!
Thank you so much @jwills! That helps and I'm getting closer.
Changing the line in dbt_project.yml
from materialized: view
to materialized: table
makes it run longer:
me@laptop ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking main dbt run
15:55:41 Running with dbt=1.5.0
15:55:41 Found 1 model, 0 tests, 0 snapshots, 0 analyses, 313 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
15:55:41
15:55:41 Concurrency: 4 threads (target='dev')
15:55:41
15:55:41 1 of 1 START sql table model main.service_requests ............................. [RUN]
16:44:46 1 of 1 OK created sql table model main.service_requests ........................ [OK in 2944.05s]
16:44:46
16:44:46 Finished running 1 table model in 0 hours 49 minutes and 4.12 seconds (2944.12s).
16:44:46
16:44:46 Completed successfully
16:44:46
16:44:46 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
However, it is missing 90% of the number of rows: there should be 33M rows instead of 2.5M:
me@laptop ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking main duckdb /tmp/service_requests.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D SELECT COUNT(*) FROM service_requests;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 2538812 │
└──────────────┘
D
The .duckdb
file is also much smaller as expected:
me@laptop ~/dropbox/projects/data_build_tool_for_datathinking.org/datathinking main ls -lh /tmp/service_requests.duckdb
-rw-r--r-- 1 me wheel 189M Jul 7 12:44 /tmp/service_requests.duckdb
If I run wget https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.csv
then it downloads the entire 18GB file, and if I execute this duckdb query then I get a 2.1GB parquet file instead of the above file of 189M (https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet - created with https://nbviewer.org/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb).
Any idea how to debug this? I might be missing something basic again -- really grateful for your help! (Context: we're working with 4 nonprofits, some national, all will be using this stack so this infra is already going a long way!!)
Mmm, okay-- still guessing here, but I'd be suspicious of the ignore_errors=True
line in the read_csv config in this model (assuming that this is upstream of whatever the service_requests
model is doing):
...like that makes me think that if DuckDB couldn't parse one of the lines of the CSV file for some reason (most likely a typing issue), then it would simply ignore it, leading to a smaller resulting data set.
Yay! That was it thank you @jwills !
Here's an end-to-end example: https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/datathinking/models/cityofnewyork.us/service_requests.sql leads to https://public.datathinking.org/cityofnewyork.us%2Fservice_requests.parquet
Code example of using this downstream for data analysis, visualization (and soon AI/ML :):
Really grateful for the help!!
Hi @jwills ! Thank you for this amazing work!! I've used it to teach datathinking.org at University of Tartu and @PrincetonUniversity.
However, I'm getting stuck in switching from a local CSV file to a remote one; here is an example model file:
https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/datathinking/models/new_york_city_311_calls.sql
This
duckdb
code works in notebooks such as this one:https://nbviewer.org/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb
This takes a large csv file from NYC about phone calls:
https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD
and exports it to a parquet file:
https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet
Before visualization.
I added the output of
dbt run
using the above model here:https://github.com/onefact/data_build_tool_for_datathinking.org/blob/main/dbt_output/new_york_city_311_calls.duckdb
This
.duckdb
database is empty:I'm not sure how to debug this -- any ideas?
Thank you so much!! (This course is free & open access for anyone to learn data thinking with the help of GPT and the latest AI tools; we work with a variety of community organizations as well so any help goes a long way :)