coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
41 stars 11 forks source link

Postgres error when trying to do a window function on a staging table #253

Closed jrlegrand closed 5 months ago

jrlegrand commented 5 months ago

Problem Statement

In my branch jrlegrand/ndc-descriptions, I'm trying to apply a window function to a staging table.

Link to staging model on branch: https://github.com/coderxio/sagerx/blob/jrlegrand/ndc-descriptions/dbt/sagerx/models/staging/rxnorm_historical/stg_rxnorm_historical__ndcs.sql

The staging table converts JSON to columns and I think it has nearly 1 million rows.

The thing I'm adding is a window function to try to flag the most recent NDC->RXCUI mapping since there could be multiple associations over time.

I get an error saying there's no space left on device...

SELECT * FROM staging.stg_rxnorm_historical__ndcs

Error

ERROR:  could not write to file "base/pgsql_tmp/pgsql_tmp559408.9": No space left on device
SQL state: 53100

image

Criteria for Success

The query above runs to completion.

Additional Information

I tried putting the window function in an intermediate model initially but also ran into this issue.

This is part of some user-requested work to get NDC descriptions for as may NDCs as possible.

jrlegrand commented 5 months ago

here's this in case it helps image

jrlegrand commented 5 months ago

OK so I ran the query and kept doing df -h in the terminal while it was running. Looks like this query is chewing through like 26GB of space for some reason...

image

jrlegrand commented 5 months ago

I'm guessing it's related to this... storing as JSONB vs JSON https://stackoverflow.com/questions/74826517/is-it-true-that-jsonb-may-take-more-disk-space-than-json-in-postgresql-and-why

@Bridg109 - thoughts?

jrlegrand commented 5 months ago

Changing to JSONB didn't seem to help... Though all I did was change the column type in pgAdmin. I'm starting from scratch but changing the data time in the Airflow DAG that gets saved to the database initially and seeing if that helps.

jrlegrand commented 5 months ago

So I figured it out. I split this into two staging tables (one for the JSON parsing and one for the window function) as opposed to doing the JSON parsing in a cte and the window function after the cte all in the same staging model. It went from taking 15 minutes and eating up like 25+ GB of space for whatever it was doing - to taking like a second.