OHDSI / FeatureExtraction

An R package for generating features (covariates) for a cohort using data in the Common Data Model.
http://ohdsi.github.io/FeatureExtraction/
60 stars 58 forks source link

Add features from post-coordinated concepts (value_as_concept_id) #262

Closed schuemie closed 3 months ago

schuemie commented 4 months ago

This PR adds the construction of features from post-coordinated concepts in the measurement and observation tables, as discussed in issue #67

As discussed, documented, and evaluated in this file, the covariate builder uses a simple hashing function implemented in SQL to compute covariate IDs from the two concept IDs and analysis ID. Although collisions in covariate IDs are unlikely, I programmed defensively so that if a collision occurs, only one covariate is selected. Note that, for the SQL hashing to function on Oracle and Snowflake, the latest version of SqlRender (1.18.0) is required, as specified in the DESCRIPTION file.

The current implementation adds a column to the covariateRef table in the CovariateData object documenting the value_as_concept_id. This has value NA for all other covariates constructed by FeatureExtraction. Note that this required changes in the Java code, and we therefore have a new JAR file.

I added several unit tests, including these features in test-query-no-fail.R so we know they don't cause errors on the various database platforms, and a simple tests on Eunomia to see if the computations are correct. More unit tests could be added, but I would ask others implement these.

I have taken the liberty of adding these new features to the default set of covariates. I know this will therefore basically affect every HADES study moving forward, but it seems that if other features based on measurements (measurements exist (yes/no) and measurement in normal range (below/within/above)) are part of the default set, then these new ones should be as well.

Related to this last comment: I observed that the measurement feature for normal range in the short term was not part of the default set, even though the long-term one was, as well as other short term measurement features. This must be an error (probably by me), so I also changed it to be included it in the default set as well.

gowthamrao commented 3 months ago

Wow! this is good stuff. i am looking forward to test this

ginberg commented 3 months ago

@schuemie thanks for your contribution and the explanation in the PR! It looks good to me. However, I can't test it locally on my Mac because it doesn't download the latest SqlRender from CRAN.

> install.packages("SqlRender")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.3/SqlRender_1.17.0.tgz'
Content type 'application/x-gzip' length 460315 bytes (449 KB)
==================================================
downloaded 449 KB

That's probably also why the R-check on MacOS fails I noticed that SqlRender has been released recently, do you think it will be available for all platforms soon?

schuemie commented 3 months ago

Yes, I'm surprised the MacOS binary isn't yet available on all CRAN servers. You can already install 1.18.0 on your Mac from GitHub, but it should be available from CRAN any day now.

ginberg commented 3 months ago

yes that does work locally indeed. But I think it's better to wait with merging the PR when the binary is available. Since otherwise we probably get an error when making a new release and pushing it to CRAN.

anthonysena commented 3 months ago

I merged in the recent changes to ensure that all OHDSI test DB servers are used for unit tests and this branch passed all tests. I'll merge in these changes now.