Closed jckitch closed 1 month ago
@jckitch since the data is stored in parquet, then reading should be pretty fast (whereas csv would be extremely slow). But let me know if that is not the case.
You can read parquet in R using the pyarrow package. There is also duckdb for R. Using duckdb you can query directly the columns that you need from all the files without the need to loop a readparquet function. That is "SELECT zip FROM 'enroll*.parquet'"
The notebook was incorporated in a private repo as it uses medicare data https://github.com/NSAPH-Data-Processing/mbsf_mortality_denom/blob/main/notes/medicare_xwalk_stats.ipynb
here is the information of the data that you will need for this analysis:
columns: bene_id, year, state, zip, curec, hmo, hmo_indicators, hmo_cvg_count, buyin_indicators, dual_indicators, buyin_cvg_count, dual_cvg_count, buyin, dual