Pipeline performance as described by medicare data coverage

NSAPH-Data-Processing / zip2county_master_xwalk

Pipeline to create master crosswalk from ZIP codes to counties, using crosswalk tables from HUD

0 stars 0 forks source link

Pipeline performance as described by medicare data coverage #23

Closed jckitch closed 1 month ago

audiracmichelle commented 1 month ago

here is the information of the data that you will need for this analysis:

path /n/dominici_nsaph_l3/Lab/parquet_dw/dorieh/medicare_schema
files enroll_yyyy.parquet

columns: bene_id, year, state, zip, curec, hmo, hmo_indicators, hmo_cvg_count, buyin_indicators, dual_indicators, buyin_cvg_count, dual_cvg_count, buyin, dual

audiracmichelle commented 1 month ago

@jckitch since the data is stored in parquet, then reading should be pretty fast (whereas csv would be extremely slow). But let me know if that is not the case.

You can read parquet in R using the pyarrow package. There is also duckdb for R. Using duckdb you can query directly the columns that you need from all the files without the need to loop a readparquet function. That is "SELECT zip FROM 'enroll*.parquet'"

audiracmichelle commented 1 month ago

The notebook was incorporated in a private repo as it uses medicare data https://github.com/NSAPH-Data-Processing/mbsf_mortality_denom/blob/main/notes/medicare_xwalk_stats.ipynb