Closed jrlegrand closed 5 months ago
here's this in case it helps
OK so I ran the query and kept doing df -h
in the terminal while it was running. Looks like this query is chewing through like 26GB of space for some reason...
I'm guessing it's related to this... storing as JSONB vs JSON https://stackoverflow.com/questions/74826517/is-it-true-that-jsonb-may-take-more-disk-space-than-json-in-postgresql-and-why
@Bridg109 - thoughts?
Changing to JSONB didn't seem to help... Though all I did was change the column type in pgAdmin. I'm starting from scratch but changing the data time in the Airflow DAG that gets saved to the database initially and seeing if that helps.
So I figured it out. I split this into two staging tables (one for the JSON parsing and one for the window function) as opposed to doing the JSON parsing in a cte and the window function after the cte all in the same staging model. It went from taking 15 minutes and eating up like 25+ GB of space for whatever it was doing - to taking like a second.
Problem Statement
In my branch
jrlegrand/ndc-descriptions
, I'm trying to apply a window function to a staging table.Link to staging model on branch: https://github.com/coderxio/sagerx/blob/jrlegrand/ndc-descriptions/dbt/sagerx/models/staging/rxnorm_historical/stg_rxnorm_historical__ndcs.sql
The staging table converts JSON to columns and I think it has nearly 1 million rows.
The thing I'm adding is a window function to try to flag the most recent NDC->RXCUI mapping since there could be multiple associations over time.
I get an error saying there's no space left on device...
SELECT * FROM staging.stg_rxnorm_historical__ndcs
Error
Criteria for Success
The query above runs to completion.
Additional Information
I tried putting the window function in an intermediate model initially but also ran into this issue.
This is part of some user-requested work to get NDC descriptions for as may NDCs as possible.