CMSgov / price-transparency-guide

The technical implementation guide for the tri-departmental price transparency rule.
354 stars 107 forks source link

Questions re: in_network schema #699

Closed felix-hh closed 10 months ago

felix-hh commented 11 months ago

Hi!

I have been developing a data pipeline to process in_network MRF files this last month. After a lot of trouble I have managed to develop something that ingests a file in ~5h on my Mac (file is 4GB compressed, 150GB uncompressed). For your reference, this the Aetna file that I used.

The high compression ratio (150GB -> 4GB) indicates that there is a lot of unnecessary redundancy in the original data. This translates into higher (maybe 30x?) processing time as data needs to be uncompressed, and instantiated as an object in the language of your choice before being ingested. Data economy would lead to smaller objects and a much better processing time.

During the process some questions came up about the current schema about choices that, IMO, are driving this inefficiency:

Based on #244 It looks like you have already optimized to reduce redundancy in the past by nesting prices on rates. These questions are about further optimization.

I have implemented these changes in my “clean” version of the dataset for downstream use. I look forward to learn about the reason for these design choices if I am wrong. I imagine this cannot be changed without a new major version of the schema.

Also, I thoroughly admire the work that has been done in putting this schema / transparency tools together and getting insurers to comply. If there is anything I can do to help or I can elaborate on this issues let me know.