Medical-Event-Data-Standard / meds

Schema definitions and Python types for Medical Event Data Standard, a standard for medical event data such as EHR and claims data
Apache License 2.0
24 stars 2 forks source link

[Proposal] Usability: Change from double -> single nesting #20

Closed Miking98 closed 2 months ago

Miking98 commented 3 months ago

The current structure of MEDS has Patient => Event => Measurements. I would propose changing it to just Patient => Events

Arguments in favor:

  1. Writing labeling functions is cleaner -- you just loop through all events, rather than doing a double for loop.
  2. It simpler to keep all the information about a clinical event in a single object, rather than dispersed across two objects.
  3. Single nesting speeds up GROUP BY / AGG in Polars by a significant amount (~50x)
  4. It's conceptually a lot simpler to just have one level of hierarchy rather than two -- I'm not sure what the double nesting adds conceptually?
EthanSteinberg commented 3 months ago

I would 100% agree, if only for the conceptual simplicity.

The initial reason for double nesting was to improve performance.

However, this seems to not matter much anymore. It only seems to be a 4% improvement for double nesting.

Here are benchmark numbers for double nesting vs single nesting when counting the number of codes in the MIMIC-IV demo dataset

Single nested: 131 MB RAM, 6.7 MB disk, 38.74 seconds

Double nested: 125 MB RAM, 7.1 MB disk, 37.21 seconds

EthanSteinberg commented 3 months ago

The benchmarking code for this was quite simple:

  dataset = datasets.Dataset.from_parquet("mimiciv_demo/data/*")

  for patient in dataset:
      for event in patient['events']:
          for measurement in event['measurements']
              num_codes += 1

I ran this 3 times on an isolated machine and took the median time

EthanSteinberg commented 2 months ago

https://github.com/Medical-Event-Data-Standard/meds/issues/23 is an example of how the double nesting results in more convoluted code than otherwise

mmcdermott commented 2 months ago

Here are some counterarguments to consider as we come to a final decision:

  1. In my opinion, the double nesting is actually conceptually simpler. I think this is more of a personal thing, and I don't use this to argue that only one perspective on complexity is right, but rather to point out that there are different notions of complexity and there will not be one right answer here. To illustrate why the double nesting is conceptually simpler, for me my mental model of these healthcare processes is that a patient interacts with the health system at some point in time (an event) at which time a collection of observations are made (measurements), which more naturally aligns with a double nesting structure.
  2. I think the double nesting is easier to convert to single-nesting in practice than vice versa. In particular, if someone wants to use the data in a hierarchical format (as, for example, ESGPT does, and many historical models used a hierarchical format as well to reduce maximum sequence length), then if the data starts in a single-nested format, the user needs to do a nested group-by within the patient events to account for the fact that timestamps may not be well ordered within a patient's events. Even if we assume the timestamps will always be well ordered, they still need to do a more complex reduction over the timestamps to group the variable length sets of observations with the same timestamp together. In contrast, going from double to single nesting is more trivial. I know that right now, the use cases with hierarchical format are fewer than those that don't, but that doesn't mean they will stay that way. A lot of other processes we may want to do over data also more naturally happen at the level of a list of unique timestamps, such as in identifying task cohorts among patients in a manner relying on temporal constraints.
  3. I am not 100% convinced about the performance implications yet. Obviously, the right way to measure this is empirically, and @EthanSteinberg's results are a great start, but I think it warrants a bit more thought. In particular, just from the stupid simple perspective of data replication, the single nesting format should have a big impact on total storage size. In some older profiling I did for ESGPT, the number of measurements per event in MIMIC-IV was on average ~37. Each measurement is one categorical code (a uint 32) and to be most conservative we'll assume a numerical value too (a float32). If we assume a 32 bit datetime, then going from double to single nesting means that each measurement goes from requiring (32 + 32 + 32/37) bits for the code, value, and the share of each measurement's event timestamp to (32 * 3), which is approximately a 50% increase in the footprint of the data. Of course, in many settings data are stored under various kinds of compression, and if you need to do your final processing in a way that is flattened anyways, you'll eventually eat that cost, but from a simple mathematical perspective I'm concerned about adopting as standard practice something that inflates the possible storage size by this amount. I'm certainly not a systems expert, so maybe I'm missing something obvious here, but this is definitely part of my hesitation.

In terms of what I'd advocate we assess before we come to a final decision here, there are a few things:

  1. I think we should expand the benchmarks (and have some established benchmarks/profiling functions, really) here so we can be confident in any performance implications across relevant use cases. This will help us not only on this issue but in the future too. These should cover both usage in models as well as the implications on pre-processing pipelines of various kinds (in particular in translating between single or doubly nested formats).
  2. We should outline in a concrete roadmap the kinds of use cases we are initially planning for and any existing efforts that fall into those categories. This will help us communicate better with the community in general the intended usage here and help us understand the relative import of this change for what use cases are planned.

My thinking is that both of these changes would be good for MEDS overall and would help make this issue clearer, so I would adovcate that we do those before we finalize this decision, if others think that sounds reasonable.

EthanSteinberg commented 2 months ago

Thanks for the response. I have some comments about some of those points.

In my opinion, the double nesting is actually conceptually simpler. I think this is more of a personal thing

I think there are actually really strong arguments that single nesting is conceptually simpler. It's one less field and one less conceptual entity in general. Even more importantly, it's one less invariant in the schema that people need to track. Right now our documentation tells people that there should only really be one event per timestamp, but we can't easily enforce that invariant and it can get easily confused / lost. You can see this confusion in practice with https://github.com/Medical-Event-Data-Standard/meds/issues/23.

I think the double nesting is easier to convert to single-nesting in practice than vice versa

The conversion functions in both directions are O(n) functions that can be done in a single utility function call. The "easiness" of conversion is not important because conversion is just a cheap function call away.

In particular, just from the stupid simple perspective of data replication, the single nesting format should have a big impact on total storage size.

In practice, it seems to have minimal impact because of the additional storage considerations of other metadata. Single digit increases in efficiency are not worth huge complexity costs. Not to mention that double nesting decreases storage efficiency of compressed data (as it's a bit harder to compress a double nested format well compared to a single nested format).

If we care about efficiency here, we should focus on at least double digit performance increases like better code handling.


I think we should expand the benchmarks (and have some established benchmarks/profiling functions, really) here so we can be confident in any performance implications across relevant use cases. This will help us not only on this issue but in the future too. These should cover both usage in models as well as the implications on pre-processing pipelines of various kinds (in particular in translating between single or doubly nested formats).

I don't see why that would be necessary. Operations that have to touch every code in the timeline are the most expensive (and common) operations and counting the is a best case scenario for double nesting vs single nesting as it is a cheap operation so the runtime is fully driven by the overhead of the respresentations. If double nesting's improvement there is so tiny, I don't see how you are going to see much benefit in other settings.

We should outline in a concrete roadmap the kinds of use cases we are initially planning for and any existing efforts that fall into those categories. This will help us communicate better with the community in general the intended usage here and help us understand the relative import of this change for what use cases are planned.

I think we have already done this? The concrete use cases we care about are running labeling functions, generating tabular features, and generating features for neural networks.

mmcdermott commented 2 months ago

I think we should expand the benchmarks (and have some established benchmarks/profiling functions, really) here so we can be confident in any performance implications across relevant use cases. This will help us not only on this issue but in the future too. These should cover both usage in models as well as the implications on pre-processing pipelines of various kinds (in particular in translating between single or doubly nested formats).

I don't see why that would be necessary. Operations that have to touch every code in the timeline are the most expensive (and common) operations and counting the is a best case scenario for double nesting vs single nesting as it is a cheap operation so the runtime is fully driven by the overhead of the respresentations. If double nesting's improvement there is so tiny, I don't see how you are going to see much benefit in other settings.

@EthanSteinberg, are you really opposed to having a set of benchmarks we can use persistently, independently of this issue? Or are you just saying that you don't want to block the closing of this issue on the establishment of such benchmarks? I think (a) benchmarks would be valuable in general and in addition I think (b) that under the assumption that we will have benchmarks then waiting to close this issue until benchmarks are complete so we can evaluate the strategies under those benchmarks makes sense, but (a) and (b) are separate points and I'm not sure whether you're objecting to only one or to both.

EthanSteinberg commented 2 months ago

@mmcdermott I'm saying that we already have the benchmark we need for this issue.

And that we have a strong incentive to move quicker rather than slower on a major schema thing like this.

EthanSteinberg commented 2 months ago

If we need additional benchmarking here to make a conclusion, let's write down a precise list of what benchmarks we need and I'll run them.

mmcdermott commented 2 months ago

I created #24 to track the question of if and what benchmarking would be needed. Ultimately I feel more advanced benchmarking would be helpful here, and of those things discussed in #24, I think critical would be (a) assessing what you have assessed already under different dataset properties (e.g., # of measurements per event, as I think that is the parameter that controls the possible efficiency penalty here) and dataset sizes and (b) assessing the impact on more realistic usage patterns, in particular in characterizing the cost of transforming data between single and double nested formats and in characterizing the cost of other operations on single or double nested formats (e.g., I would expect operations like adding time-derived measures like age and time-of-day to favor the double nested format and operations like ontology expansion to maybe favor the single-nested format, but I don't have a good sense of the expected magnitudes). I'd also be very interested in (c) understanding how this impacts eventual tensorization and GPU iteration costs, but I understand that is less relevant ultimately to this decision because that is always going to be model-dictated, not standard dictated, whereas the implication on other pre-processing steps is something where the raw format will have a bigger impact.

EthanSteinberg commented 2 months ago

I agree and disagree a bit with your proposed experiment list.

Let's simplify things a bit by numbering them.

a)

  1. Efficiency as a function of # measurements per event

I agree with this one, I think it is valuable.

  1. Effeciciency as a function of the dataset size

This one doesn't make any sense? All of our costs should be proportional to the size of the dataset. There is no way this could have an impact here.

b)

  1. Cost of transforming back and forth

I don't think we need to measure the cost of going from double to single nested, we only need to measure single to double as we already know double is slightly more efficient.

  1. Other operations

We only need to measure this if the transformation cost (b3) is significant. And I highly doubt that cost will be significant. As you can always transform back to double and run the double algorithm if that one is faster.

c.

  1. Other operations GPU

Same answer as 4, irrelevant if b3 is a minor cost.


From this I would propose the following additional experiments to prove that single nested is a marginal decrease in efficiency.

Measure two things:

Under two experimental conditions:

EthanSteinberg commented 2 months ago

@mmcdermott

Ok. I now have all the results of the experiments you requested, and they appear to back my claim that the overhead from single nested is very minimal.

https://github.com/EthanSteinberg/meds_benchmark is my benchmark code repository.

I have taken the opportunity to make a couple of improvements from before. In particular, I removed the python startup time from my measurements and increased the run count to 30 runs per program to reduce the influence of noise.

https://docs.google.com/spreadsheets/d/1sYn7RGkk8O1ra_NKgNcOUN2oeX2wPKwtTMBmqJLH3_E/edit#gid=0 contains my result table.

We are measuring on two datasets, the real MIMIC demo, and a semi-artificial version of MIMIC demo where every event has the same timestamp.

Summary:

Single nested has a RAM overhead of about 5%, with no (or negative) disk overhead compared to double nested for both the real data and the data with only one timestamp value. (The disk values are different from before because I used a different library to generate the compressed files, but the single nested vs double nested comparison should be valid).

The results are a bit more complex for the timing experiments.

Real data: Single nested has a 5% CPU overhead compared to double nested for simple iteration. Converting from single nested to double nested has a very minor overhead (2% CPU).

Semi-artificial data with only one timestamp: The single nested now has an overhead of 12% CPU. This is expected as the double nested is more efficient when there are fewer timestamps. Converting from single nested to double nested has roughly zero overhead (less than 1% CPU).

I believe this addresses all your concerns from a), b) and c) that you listed in your desired experiments. In particular for b) and c) we now know that the overhead is bounded between 5% and 7% for real data. Even in a very extreme case, where everything has the same timestamp, the overhead is still only 12%.

It's also important to note that these overheads are overestimates. In real code there will be more compute in the inner for loops so the influence of this overhead will be much smaller in practice.

Are there any other experiments you want to see?

mmcdermott commented 2 months ago

Thanks for expanding the benchmarks and humoring my desire for greater investigation here @EthanSteinberg. It is definitely looking like moving to single nesting would make more sense, but there are two additional things I'd want to see first @EthanSteinberg.

  1. The more important one is that I am concerned about one additional confounder related to data; namely, how much additional metadata is stored within the measurements. In particular, from my naive perspective outlined above, adding timestamps to all measurements should be expensive; but, if measurements are storing a bunch of additional properties simultaneously, then this is no longer true -- then timestamps with all measurements become a pittance relative to the rest of the measurement. This is important for us to understand -- it may not change the outcome of this decision, because we might decide that we expect the predominant usage to feature large, if varied measurement properties, but even so we should understand it because a lot of processing will be able to safely ignore measurement properties, and if those operations would feature a big performance hit when in a single nested format, that is something we should know. To that end, it would be worth either stating (1) that the measurements have no or minimal additional properties in the benchmarks you've run or (2) figure out what the #s are in the case that we don't include additional properties.
  2. The style of benchmarks you have included so far cover only a small subset of the kinds of transforms I expect people to use. The only relevant thing to add for this issue in particular is in characterizing how per-timepoint operations differ from per-code operations. In particular, things like calculating a patient's age, recording the time-of-day of an event, computing statistics over the inter-event time distribution, all can happen at the level of unique timepoints, not at the level of codes.

As a different note -- given the minimal impact on performance you are seeing here, is it worth actually moving to no nesting? E.g., having a totally flat structure of events which contain patient IDs directly alongside timestamps and code/value? To be clear, I still see the double nested format as the simplest, conceptually, but I recognize that of those on the issue so far at least (and I suspect more broadly) I am in the minority, so perhaps it is worth considering moving to an even more flat structure?

Finally, independent of this issue, for the benchmark system in general I think there is some merit in controlling for dataset size and such in a more granular way than you have done here. This need not gate our decision here, to be clear, but I wanted to highlight that it is fully possible that performance will differ both qualitatively and quantitatively at different data scales, and we should be wary of that

mmcdermott commented 2 months ago

Also as we approach finalizing a decision here, tagging @tompollard @rvandewater @Jwoo5 so you all can weigh in on this too. @Miking98 and @jason-fries obviously more comments from you would be welcome but I took your prior indications on this thread to indicate support for this more generally.

EthanSteinberg commented 2 months ago

I also want to make an important point to emphasize why adding double nesting for a 5-7% improvement isn't worth it.

There are so many other and better opportunities for improving performance than paying the complexity price for double nesting.

Let's give an example:

All the experiments above were run using huggingface datasets.

If we use femr's optimized data loader instead, the runtime drops by about 40 times.

import femr_cpp
code_counts = collections.defaultdict(int)
def process_patient(patient):
    for measurement in patient['measurements']:
        code_counts[measurement['code']] += 1
femr_cpp.process_patients("mimic_demo", process_patient)
print(sorted(list(code_counts.items()))[:10])

The time of that operation goes down to 0.77 seconds (instead of 30.63 for double nested with huggingface datasets).

mmcdermott commented 2 months ago

Well, that is a much more significant improvement, and I see and agree with your point that there are higher priority things to focus on in terms of performance. But, as MEDS doesn't include femr_cpp (so we can't advocate for its use as the supported use path) and as this is the core of the data standard, I think it is worth understanding the impact in a variety of expected usage patterns before making a final decision. I'll say again that the complexity argument is not going to be a universal one -- i.e., not everyone will think that single nested is less complex conceptually than double nested, as that is a question of opinion and not fact -- so regardless of which way we go some users will be inconvenienced or confused. I recognize that my take on complexity (favoring double nesting) is likely the minority view, but as it is not universally one-sided, I think it is worth profiling the costs thoroughly, especially because building such profiles is something we want anyways a la #24

jason-fries commented 2 months ago

I vote for single nested.

mmcdermott commented 2 months ago

@jason-fries do we have numbers for the ETL complications? If those are still present, that would both (1) be something that should definitely be included in benchmarks and (2) be conclusive evidence in favor of single nesting here. To my knowledge based on a hazy recollection from one of our calls, I thought that the polars concerns that @Miking98 mentioned in the issue description were not fully accurate, so I was assuming that those slowdowns were not in play here -- if they are, that would definitely impact my recommendation.

Also, should we go even farther, if the impact on ETLs is that large -- to a fully unnested format, where we collapse out over patients as well?

rvandewater commented 2 months ago

Right now, I also favor single nesting. I have to add that I have not yet worked extensively with FEMR or MEDS in my pipeline. This might change soon, and then I can weigh in more.

scottfleming commented 2 months ago

I've been heavily using both MEDS and FEMR on STARR-OMOP (3M patients) and private data (~50k patients at a time, on the order of MIMIC-scale).

Also, should we go even farther, if the impact on ETLs is that large -- to a fully unnested format, where we collapse out over patients as well?

I would strongly advocate for the fully un-nested format, where we effectively have a single (potentially sharded) table with each row as a measurement. Ignoring theoretical O(*) arguments here, the fact is that many tools (duckdb, polars, SQL, snowflake, PySpark) are all built and optimized for this kind of data representation.

Take, for example, duckdb's larger-than-memory out-of-core processing capabilities. They explicitly state that aggregate functions such as list (which is required for single- or double-nesting) are not supported.

Additionally, if you're using a data warehouse backend like Snowflake or BigQuery, then wrangling fully un-nested data and just trading money for speed is easy. If, instead, conversion to MEDS leads to a list type in a column (at the level of patients for single-nesting, and additionally at the level of events for double-nesting), you essentially throttle the speed of any post-processing steps by potentially orders of magnitude.

Strongly in favor of completely un-nested.

mmcdermott commented 2 months ago

I would be equally fine with any of the three formats in the case that major performance discrepancies exist favoring that format, and it sounds like completely unnested fits that bill, given ETLs (which are likely the most expensive processes) will favor that representation.

It sounds like the decision steps here are to get some ETL examples in the shared benchmark referenced in #24 so that those #s can contribute here?

I'll note that I think going to completely unnested would complicate our planned model for some other processing steps, but given as we can easily provide converters between any of the formats that folks can use in cases where it is desired, that does not seem like it should be higher priority than the implications the default format has on high-impact performance scenarios, so it doesn't dissuade me from thinking that it sounds like completely unnested is the way to go here.

@EthanSteinberg do you have any objections to completely unnested?

tompollard commented 2 months ago

Just to add some quick thoughts:

mmcdermott commented 2 months ago

@scottfleming if we go fully unnested, how would you recommend storing the static_measurements? Should those just be as normal measurements but with timestamp set to null? That seems simplest to me, though (as with other things in the fully de-nested state) it makes some aspects later more complex and less conceptually aligned.

EthanSteinberg commented 2 months ago

@EthanSteinberg do you have any objections to completely unnested?

The main issue with a fully unnested format is that it's difficult to write labeling functions and neural network featurization on a completely unnested schema.

And, unlike double vs single nesting, the transformation between fully unnested and single nested is very expensive.

We also already have a completely unnested schema, MEDS-Flat, and it is important for certain usecases where the overhead due to nesting is unacceptable (or nesting makes certain technology impossible to use). We might want to move MEDS Flat into the main standard (into this repository as another pyarrow schema), but I think regardless we would want a nested schema of some sort to make it possible to write labeling functions and neural network featurization in a straightforward manner.

I hate having multiple schemas (MEDS Flat vs MEDS), but I think this is unfortunately a case where both are necessary.

mmcdermott commented 2 months ago

I tentatively am thinking I agree @EthanSteinberg, but I think at this point we really need to nail down the kinds of usage we are intending to support (in writing, as @tompollard suggested above) as the scope of what we are proposing to support is rapidly becoming unclear here and which standard would be most useful in what setting is likewise getting challenging. We'll also need conversion functions between both views, and in the case that we have that we should also then discuss whether the expectation is that ETLs will write out both schemas or just one (and if so which) where conversions can be used as needed.

I've made #25 to track the usage document.

I'll re-iterate that I think #25 is especially important as we consider having two primary schemas because it is clear even in this conversation (and from #23) that we are all envisioning different kinds of usage patterns which require different data arrangements, and it isn't clear which of those patterns we want to principally support and if so how we intend to do that. some concrete guidance would be helpful for us in designing the tools for this schema and its core data format as well as for users in understanding how to best use MEDS.

scottfleming commented 2 months ago

I guess one key clarification would be what the conceptual purpose and form of MEDS Flat is?

Specifically, by "MEDS Flat" do we mean an un-nested data format where all measurements across all sources (labs, medications, flows) are all integrated/joined for each patient?

Or, by "MEDS Flat" do we mean an un-nested data format for source-specific tables? So e.g., we have a labs table, a flows table, a medications table, all of which are in "MEDS Flat" format, but the key conceptual difference is that these tables aren't joined.

The un-joined "MEDS Flat" could be useful in its own right, but inherits a lot of the challenges that PyHealth has, where they treat each channel independently. This un-joined MEDS Flat is currently the intermediary for a lot of the ETLs we have in meds_etl.

The joined "MEDS Flat" is very useful and allows for a lot of important transformations to be performed much more quickly than if there were nesting involved. Counting the number of measurements per patient, finding the timestamp of the last measurement for each patient (important for calculating censoring times), filtering in/out all codes of a particular type -- all of these are painfully slow with nesting and lightning fast without.

how would you recommend storing the static_measurements? Should those just be as normal measurements but with timestamp set to null? That seems simplest to me

Yeah probably I would store these just as normal measurements but with timestamp set to null. That also seems conceptually clear to me, but violates the assumption that the timestamp column is non-null.

To @EthanSteinberg's point, there may still be value in having the single-nested representation for the purpose of eg defining labels based on a particular horizon (though you can arguably still do this in polars using map_elements). It'd be nice to pin down some specific labeling functions that are trivial with single-nesting but extraordinarily difficult in the completely unnested schema.

In general, I'd be interested in having two concepts supported by the schema, with ETL tools to generate them and "recipes" for common ETL steps on top of them. The first would be the joined "MEDS Flat" discussed above. The second would be single-nested MEDS.

For purposes of e.g., cohort selection and code filtering, those should just be done on MEDS Flat and potentially there's a layer of tools to support that (or we just expect people to do the ETL into MEDS Flat on their end and define high-level cohort-selection operations & code-filtering operations using whatever backend they want such as duckdb, polars, Snowflake, BigQuery etc). Then for defining labels and modeling, I imagine that could happen for single-nested MEDS. But given how much faster operating on flat data vs. nested is, I would imagine almost everything would be done on MEDS Flat. For simple labeling tasks (i.e., find time of death based on if/when a death code appears in the patient's timeline) users can construct those on top of MEDS Flat. For more complicated labels, they can use our provided tools for nesting and then define those labels using the single-nested data as a very last step. But at least the user then has control over the tradeoff between engineering complexity and time, especially at large scales.

mmcdermott commented 2 months ago

@scottfleming -- just as a quick clarification, I would only be in support of a "joined" flattened view in which we would basically take the structure of the nested schema and just flatten it out entirely -- so a single (potentially sharded) table with four columns: patient ID, timestamp [can be null], code, and value. Obviously I'm skipping measurement properties and different value types here, but in essence that's what I'm picturing. Separate tables would in my opinion defeat the simplicity that MEDS brings to bear by unifying everything together into the "measurements" idea of code/value structs (though these would now be completely flat)

scottfleming commented 2 months ago

I would only be in support of a "joined" flattened view in which we would basically take the structure of the nested schema and just flatten it out entirely -- so a single (potentially sharded) table with four columns: patient ID, timestamp [can be null], code, and value.

Perfect, I think we're in agreement here. Though @EthanSteinberg brought up a good point that I overlooked which is that if you're lazy-loading tables anyway and let's say labs are stored separately from vitals, then "join" and "concatenation" are the same thing as long as the metadata columns are aligned. And concatenation is "free". Additionally, because with parquet's columnar storage you pay minimal extra cost for nulls (basically just the bit-map for remembering where the nulls are), then getting to a single table that can be read into polars from a collection of "MEDS Flat"-compliant files (one for labs, one for medications, one for vitals, etc) that all have the core patient_id, time, code, text_value, numeric_value, and datetime_value columns but different metadata columns is as simple as taking the union of metadata columns and adding any missing metadata columns to each of the individual tables. This reduces the "ETL" to just column renaming and typecasting in many instances, which dramatically reduces the barrier to using "MEDS" for people working with millions of patients.

Speaking from the industry side where we store all our data in Snowflake, for example, it also means that we can easily create a "MEDS Flat" view of our data and keep it in Snowflake for storage and analytics.

It also means that doing things like filtering in/out measurements based on metadata columns (e.g., visit type) is trivial, fast, and can easily be done using basically any backend the user prefers.