I have been developing a data pipeline to process in_network MRF files this last month. After a lot of trouble I have managed to develop something that ingests a file in ~5h on my Mac (file is 4GB compressed, 150GB uncompressed). For your reference, this the Aetna file that I used.
The high compression ratio (150GB -> 4GB) indicates that there is a lot of unnecessary redundancy in the original data. This translates into higher (maybe 30x?) processing time as data needs to be uncompressed, and instantiated as an object in the language of your choice before being ingested. Data economy would lead to smaller objects and a much better processing time.
During the process some questions came up about the current schema about choices that, IMO, are driving this inefficiency:
Why is service data part of the negotiated_price object? This causes a lot of redundancy.
By service data, I mean service_code, modifiers, etc.
In this file, there are 538,296,072 unique service-provider_group-price combinations, however there are only 835,189 unique services. We repeat service data many times.
We could instead have an array of “service” objects nested on the “in_network object” defining service codes etc, and then have rates objects (like the current one, but negotiated_price is just a number, no other metadata) nested for each service, defining the rate and the provider_group_ids that correspond to that rate. It is true that this impedes linking a provider to multiple services at once, but enables linking a service to many provider groups at once (a better tradeoff, since there are about 8 services associated with an in_network object, but there are many more provider groups!).
Why is service_code an array of strings?
Service codes are defined here and range from 0 to 100, where 0 is ‘CSTM-00’ in this schema. First, this could be an array of ints, already saving space. But secondly, this could instead be replaced by 2 base 64 integers, functioning as a bitmask, where the ith bit is flipped if the service code is in the array.
Imagine replacing the redundant arrays of service codes by just 2 integers amenable to bitwise operations. This dataset is for a technical audience that would appreciate going from ~500-bit slow to process arrays to a 128-bit pair of integers.
(minor) Why use floating points for negotiated_rate?
In finance, integers are usually used for dollar amounts, expressed in cents (unofficial reference) since floats i) can be imprecise, ii) are less fast to process, and iii) have some gotchas for example, when parsing. In this dataset negotiated_rate can also be a percentage, but the point stands - the percentage can be reported as basis points multiplying by 100 to a good level of precision (I would assume!).
Based on #244 It looks like you have already optimized to reduce redundancy in the past by nesting prices on rates. These questions are about further optimization.
I have implemented these changes in my “clean” version of the dataset for downstream use. I look forward to learn about the reason for these design choices if I am wrong. I imagine this cannot be changed without a new major version of the schema.
Also, I thoroughly admire the work that has been done in putting this schema / transparency tools together and getting insurers to comply. If there is anything I can do to help or I can elaborate on this issues let me know.
Hi!
I have been developing a data pipeline to process
in_network
MRF files this last month. After a lot of trouble I have managed to develop something that ingests a file in ~5h on my Mac (file is 4GB compressed, 150GB uncompressed). For your reference, this the Aetna file that I used.The high compression ratio (150GB -> 4GB) indicates that there is a lot of unnecessary redundancy in the original data. This translates into higher (maybe 30x?) processing time as data needs to be uncompressed, and instantiated as an object in the language of your choice before being ingested. Data economy would lead to smaller objects and a much better processing time.
During the process some questions came up about the current schema about choices that, IMO, are driving this inefficiency:
negotiated_price object
? This causes a lot of redundancy.service_code
an array of strings?negotiated_rate
?negotiated_rate
can also be a percentage, but the point stands - the percentage can be reported as basis points multiplying by 100 to a good level of precision (I would assume!).Based on #244 It looks like you have already optimized to reduce redundancy in the past by nesting prices on rates. These questions are about further optimization.
I have implemented these changes in my “clean” version of the dataset for downstream use. I look forward to learn about the reason for these design choices if I am wrong. I imagine this cannot be changed without a new major version of the schema.
Also, I thoroughly admire the work that has been done in putting this schema / transparency tools together and getting insurers to comply. If there is anything I can do to help or I can elaborate on this issues let me know.