OHDSI / circe-be

CIRCE is a cohort definition and syntax compiler tool for OMOP CDMv5
Apache License 2.0
9 stars 13 forks source link

Replace javascript semver with pure-Java implementation #188

Closed chrisknoll closed 1 year ago

chrisknoll commented 1 year ago

Currently, the codebase depends on a JavaScript runtime engine that is embedded in Java in order to invoke semver expressions that are compatible with javascript libraries. Since this was introduced in 2018, there has been progress making npm-based semver expressions supported in a pure Java library.

The issue with the current implementation is the dependency on a JavaScript engine leads to a 50MB dependency that excludes us from being included in certain repository environments (ie: CRAN).

There is a library that I would like to investigate: semver4j.

This issue is to track the change to remove the JavaScript implementation of semver checks and replace it with a pure Java impl.

RowanErasmus commented 1 year ago

@chrisknoll I'll pick this up if you have not already begun

chrisknoll commented 1 year ago

Sounds good, thanks. I saw you removed guava, but I'm curious about that PR about where the implementation from the JavaScript-based semver work (that required guava, I think?) is replaced with the semver4j. I may have missed it, I'll look closer.

RowanErasmus commented 1 year ago

I'm working on it now, could not find something right out of the box, but am piecing together some functions from sever4j to make it work...

Guava was just required for the single method I replaced with one line...

chrisknoll commented 1 year ago

Ok, for context (if this helps) the purpose of this code is to yield a CDM version for a given asset (for example, cohort expression) and during serialization the function is called and it assigns the appropriate CDM version to the object.

The CdmVersion is a custom annotation used to specify the CDM version that is allowed (you can the use at the CohortExpression and criteria attribute level).

This is all handled in the StandardUtils repo in the CdmVersionUtils class.

I think the central class to needing this guava (or is it gral??) runtime is in the SemverUtils function getRangeIntersection. If this can be changed from using a jsContext to invoke a javascript function and instead jsut use the pure-java impl, then I think that would get rid of it.

So, I may be confusing guava with GralVM so, sorry about that, but this issue is about removing gralvm and the implementation in SemverUtils with a pure java impl.

For reference, this is the javascript that performs the semver-intersect work that is used to determine the semver intersection logic. So there's more to the work than just replacing a semver javascript implementation with a prue java one, but also taking the 'semver intersection logic' from the javascript implementation and changing it into a pure-java one.

-Chris

RowanErasmus commented 1 year ago

Thanks for the details, its the Graal stuf indeed required for the js.

The only question I have is to confirm wether or not the ohdsi cdmvalue only uses these >=, = kind of annotation and not this ~ ^ stuff from npm?

Rowan

chrisknoll commented 1 year ago

I believe we only use the >, >= to indicate that the supported version is at least a value (like >= 5.0 or >= 5.4. I think we will need to support a range that is <= because eventually we get to things that are no longer supported in 6.0 so we will say something like range= >=5.4 && < 6.0. Something like that. I do not believe we would say something like ^5.2....

I think the idea was that you have your root object like cohort expression that says >=5.0 From this starting point we get a range (maybe the starting point is actually * and from there we set up a lower and upper bound for the range. Then we scan through the object elemetns looking for CdmVersion annotatins, and we 'shrink' the valid cdm range.

Example: '* + >=5.0 becomes >= 5.0, then if you find >=5.3 later, it becomes >= 5.3...and later if we see < 6.0, the final range will be >= 5.3 && < 6.0 (in other words valid range is between 5.3 and up to but not including 6.0). Maybe if we found things like ^5.2, those things get put into the range expression like >= 5.3 && < 6.0 && ^5.2.

The point of yielding this cdm version range is that once we have this cdm version range expression (ie: >= 5.3 && < 6.0 && ^5.2), we can then check to see if a given CDM version is compatable with the semver expression:

isCompatable("5.7", ">= 5.3 && < 6.0 && ^5.2") => true isCompatable("6.1", ">= 5.3 && < 6.0 && ^5.2") => false isCompatable("5.2", ">= 5.3 && < 6.0 && ^5.2") => false

We don't have a use of isCompatable yet but the idea is that when we support multiple CDM versions in circe, we can check if a target Source (which has a specific version) is compatible with the expression semver spec In the above example, the first param is the CDM version of the source, and the second param is the CdmVersion range from the java object (from determineCdmCompatibility.

Note that the determineCdmCompatibility() function is that you call the function and has a side effect: the input param (the CdmCompatibilitySpec ) has its setCdmVersionRange() on it after the version range is determined inside the function. It feels a little confusing to me but I think they wanted to have this function as a void() signature.

But I think that covers the extent of the functionality: determineCdmCompatability establishes the semver range expression based on the elements (found by traversing the object graph). If there is a conflict for some reason (like the object resolves to requiring <6.0 && >= 6.1) it will throw an exception.

RowanErasmus commented 1 year ago

I've had a first crack at it https://github.com/OHDSI/StandardizedAnalysisUtils/pull/7

The && symbol is not part of the node semver syntax (which was the original implementation) it has ranges like this 3.4 - 5 in stead, but the idea stays the same :-)

chrisknoll commented 1 year ago

The && symbol is not part of the node semver syntax (which was the original implementation) it has ranges like this 3.4 - 5 in stead, but the idea stays the same :-)

Ok, that's cool. I think we can live with the functionality where it start with 'any range' (*) and then based on what it finds in the attributes of >, >=, <, <= it will narrow the range to a finite bound (ie: I think the original impl would be able to produce multiple ranges with different boundaries, but if we don't have && anymore and it's just a 'between' operator then It should like we can only have 1 range). Is my assumption about the single range true? or can you have something like 5.0 - 5.2 5.3-6.0? I'm not sure how you use the exclusion of a value (like 5.0-6.0 but do not include 6.0).

RowanErasmus commented 1 year ago

Yes you are right, I did not bother with taking care of the 'or' which is a || I think, and it would be 5.0 - 5.2 || 5.4 - 6.0 so you get everything between 5.0 and 6.0 except 5.3. If you pass that now it throws an exception I can give it a go I guess, adding a couple of lines of extra insanity never hurt anyone.

chrisknoll commented 1 year ago

Fixed in fa4391bc4bb5899d7ba4a2a2bdc133d2fa856404.