cqframework / clinical-reasoning

CQF Clinical Reasoning on FHIR for Java
https://www.cqframework.org/clinical-reasoning/
Apache License 2.0
35 stars 29 forks source link

FHIR Bundle Engine Performance improvements #140

Open csenn opened 2 years ago

csenn commented 2 years ago

I've been wondering about the maximum performance possible with cql_engine for use over populations. The case I'm looking at is evaluating a single CQL library (along with it's included libraries and ValueSets) against a population of patient FHIR Bundles.

I was having issues with the evaluator Dagger api because I wanted to ensure compiling the CQL to engine.Library only once (instead of on every iteration of the loop of the patient which required a new DataProviderFactory). But I was able to use cql_engine more directly to get it working with the same underlying classes.

In the test I'm using 1,000 Synthea FHIR bundles. First, they are all loaded from disk into memory and parsed as HAPI FHIR bundles. Then all the CQL is translated into ELM and loaded into engine.Library classes.

The test uses a small to mid sized library with 4 total included libraries (including FHIRHelpers). It also includes two ValueSets with 116 codes and 15 codes respectively.

The current performance is quite good with 2,745 FHIR Bundles per second over 15 runs. But just to see if there was any room for improvement, I created a FlameGraph to check out the execution and some low hanging fruit immediately popped out:

FlameGraph Perf

81% of the execution is taking place in the anyCodeInValueSet function.

This algorithm is effectively an O(mn) calculation, where:

Instead, if the Coding was stored in a hashable lookup of some sort you should be able to use set operations and get this down to O(min(m,n)). See python set operations time complexity for reference.

Instead of:

for resource in resources:
     for valueSetCode in valueSet.codes:
          if (resource.code === valueSetCode)
               return true

You could do:

for resource in resources:
      if (resource.code in valueSet.hashedCodeLookup)
           return true

There are a few things to work out, such as there being multiple codes in a resource, or the case that a ValueSet has less codes than the number of resources where you'd want to flip the loop.

I'll spend a little time prototyping when I get a chance, but interested in your thoughts on this and if it makes sense.

csenn commented 2 years ago

Turns out it was pretty easy to try this, just needed to implement a TerminologyProvider and replace BundleTerminologyProvider

Here is a gist for OptimizedBundleTerminologyProvider.

The test was run with a ValueSet of 10 codes, 100 codes, 1000 codes, 5,000 codes, and 10,000 codes. Each was run over 10,000 Synthea records.

Original 10 codes - 3.992s Original 100 codes - 4.073s Original 1000 codes - 5.712s Original 5000 codes - 12.521s Original 10,000 codes - 32.86s

HashSet 10 codes - 3.942s HashSet 100 codes - 3.833s HashSet 1000 codes - 4.284s HashSet 5000 codes - 4.135s HashSet 10,000 codes - 4.437s

So as the ValueSet gets bigger, the O(mn) calculation has more of an effect, although even at 1000 codes the effect is only 1 second difference. The HashSet operation maintains effectively constant lookup time.

In small ValueSets the effects are pretty small. There are cases where ValueSets get pretty large, as many as 20,000 - 30,000 codes, and the optimized class is useful there. Whether ValueSets should be that large is another question though.

One more thing to note is it takes ~22 seconds to load and parse 1180 Synthea records from disk into HAPI objects. So all of these runs have some significant i/o overhead.

FhirVersionEnum fhirVersionEnum = FhirVersionEnum.valueOf("R4");
FhirContext fhirContext = fhirVersionEnum.newContext();
IParser parser = fhirContext.newJsonParser();

List<Bundle> dataBundles = new ArrayList<>();
for (File file : syntheaFolder.listFiles()) {
            if (file.isFile()) {
                FileReader fileReader = new FileReader(file);
                Bundle syntheaBundle = parser.parseResource(Bundle.class, fileReader);
                dataBundles.add(syntheaBundle);
            }
}
JPercival commented 2 years ago

This is great! It needs a few tweaks to work as a drop-in replacement. The CQL specification makes a distinction between equals and equivalent semantics.

https://cql.hl7.org/09-b-cqlreference.html#equal https://cql.hl7.org/09-b-cqlreference.html#equivalent

For Code values, equivalence is defined based on the code and system elements only. The version and display elements are ignored for the purposes of determining Code equivalence.

So, we need to work through that for the HashableCode

As far as parsing all the files, because the filename is the ValueSet Id, and the ValueSets are referenced by Url in the CQL, and there's not necessarily any relationship between the Id and the Url (the URL is actually defined as canonical base + name), we need to come up with a solution that allows us to maintain an index of url to filename. Then we only have to load the appropriate ValueSets for any given execution.

JPercival commented 2 years ago

It's also probably worthwhile to file an issue on the cql-engine project to provide hashCode and equals implementations on the CQL dataTypes.

csenn commented 2 years ago

Hey @JPercival, glad you are interested incorporating this. I do know of realistic use cases within HEDIS measures where there are large value sets and this speed up could definitely be useful.

I am not at all attached to my implementation, just wanted to collect some numbers first. In fact, I was a little surprised there was not more benefit at the 100-1,000 codes-in-valueset range, as I would have assumed that 100 extra operations could as much as double the full execution stack of a small-ish CQL file. But who knows what the java compiler is doing.

In terms of equals vs equivalent, I understand that distinction in CQL, but the reason it was done this way is because of this: https://stackoverflow.com/questions/5396939/hashcode-and-equals-for-hashset. Equals is required along with the hashCode() function for the HashSet. So HashableCode was meant to only be local and only used in this file exactly like this line, however if you have a better idea of something higher in the inheritance tree happy to hear. I'm newish to Java so might not have the best architectural ideas yet.

This implementation also holds essentially 3 versions of all the ValueSet codes in memory which is not super efficient either.

In terms of ValueSet file names, that could definitely be helpful in some cases. We pull ValueSets out of Postgres now, all ValueSets for all libraries in the execution tree. In theory there may be some extra value sets though (ones that are used in an included library but not accessible by the calling library).

EvanMachusak commented 2 years ago

Hashing the value set expansions is also what we do at NCQA. HEDIS measures use some value sets that have well over 50,000 codes (acute conditions comes to mind).

We don't support an equals vs. equivalent distinction on these codes because in practice it's never what measure authors want - I'd go so far as to say the equals vs. equivalent semantics should not exist for the code type in the CQL language. If you really wanted to compare version and display, you could do so explicitly - but you won't ever want to do that.