ExpandValueSet is called aggressively

EvanMachusak commented 1 year ago

Consider this CQL:

define function "code in value set"(code Code, codes List<Code>):
    code in codes

When this function is called with a value set, e.g.:

valueset "Value set": 'url'
define "call it":
    "code in value set"(Errors."Unit mismatch", "Value set")

The operand in the ELM to the "code in value set" call is an ExpandValueSet.

The value set does not in fact need to be expanded, because this is valid:

define function "example"(code Code):
    code in "Value set"

The expression of "example" is an InValueSet.

Eagerly expanding the value set ahead of time introduces a performance penalty for this pattern.

It's possible to declare the parameter as a System.ValueSet which will avoid the ExpandValueSet operand, but because of the automatic expansion of value sets with this usage, these two signatures become ambiguous:

define "call it":
    "overload"(Errors."Unit mismatch", "Value set")

define function "overload"(code Code, codes List<Code>):
    code in codes

define function "overload"(code Code, valueSet System.ValueSet):
    code in valueSet

CQL-to-ELM picks the second overload in this case. That's probably correct, but unintuitive.

We use the pattern of passing value sets to functions taking a List of Code often. It's a more flexible signature since it lets us pass either a value set or a manually constructed list of codes.

This is a significant performance penalty when run on the CQF ruler as prematurely expanding value sets which can be large (50k+ rows) results in linear checking of a list of codes instead of an optimized hash lookup of valueset membership.

brynrhodes commented 7 months ago

I think it's important to call out here that code in List<Code> is a different operator than code in ValueSet, and that there's a really important distinction there. I wonder if it would make more sense to always use the ValueSet overload, and to support a way to convert a List<Code> to a ValueSet, so something like:

define fluent function toValueSet(List<Code>): ValueSet

Then you can 1) always use the optimization of lazy expansion, and 2) ensure that you're always using terminology semantics for the in operator.

brynrhodes commented 7 months ago

Having said that, there is, by design, no way to construct a ValueSet with the codes that it contains. So let me ask the question, why not have first-class value set definitions for all the lists of codes you want to support?

EvanMachusak commented 6 months ago

Because we can implicitly convert a ValueSetRef to a List<Code>, it makes List<Code> the more attractive choice for all APIs because we can either do:

define function Foo(codes List<Code>): ...`

define Bar: Foo("My ValueSet")
define Baz: Foo({ "DRC1", "DRC2" })

The issue I raised here is related to the fact that ExpandValueSet is an expensive operation. Because there's an implicit conversion, the programmer isn't aware it's happening.

In our SDK, we implemented an implicit conversion between ValueSet to List<Code> in such a way that we preserve the fact that the source object is still a ValueSet, e.g. you can think of it like our ValueSet representation implements IList<Code> so we can use it anywhere that a CQL List<Code> (represented in .NET as IList<Code>) is expected. We actually do this with composition and a facade, but it amounts to the same result.

This lets us implement the In operator for CQL's List<Code> more efficiently if that List<Code> is actually a ValueSet.

We currently don't implement ExpandValueSet the way I think it's intended to be (e.g., you reach out to the terminology server and ask for an expansion). We don't support terminology servers in general at the moment, because HEDIS measures would never allow you to expand your own value sets. We provide the explicit expansion that you must use, and you are not allowed to use any other codes.

cqframework / clinical_quality_language

ExpandValueSet is called aggressively #840