OHDSI / FeatureExtraction

An R package for generating features (covariates) for a cohort using data in the Common Data Model.
http://ohdsi.github.io/FeatureExtraction/
60 stars 58 forks source link

Support for observation / measurement values #67

Closed pavgra closed 3 months ago

pavgra commented 5 years ago

I am wondering if there is a way today to create distinct covariates for Observations / Measurements based on values in the records e.g. value_as_concept_id. In other words, not a covariate per each observation_concept_id but a value per each combination of observation_concept_id and value_as_concept_id (e.g. observation_concept_id = 3022698 Stage group. Clinical cancer, value_as_concept_id = Stage 0 or Stage I or Stage II).

schuemie commented 5 years ago

Hi @pavgra ! There is currently only an analysis for measurement values, so based on value_as_number , not value_as_concept.

I agree it would be good to also have covariates based on value_as_concept. One slight challenge is how to construct covariate IDs. We only have 53 bits in R to represent an integer. 32 bits are needed for the main concept ID. 3 digits (roughly 10 bits) are used to encode the covariate analysis ID. That leaves 11 bits (3 more digits) to encode the value_as_concept_id. Another constraint to mention is that we want covariate IDs to be the same across analyses and databases. So I can't just list all observed concept IDs in a database and label them 1 though n, because that would be different on every database.

The solution I use to encode the unit_concept_id for measurement values is to just use the 3 least significant digits of the unit_concept_id. The downside is that we can have ID collisions, but it does meet all the other requirements. We could use a similar approach for value_as_concept_id?

pavgra commented 5 years ago

@schuemie , why cannot we use hashing to create the covariate ids? So you'll concatenate all the original IDs and run a hashing function over the string. This is both quite standard approach and I believe less collision should occur

schuemie commented 5 years ago

That is an interesting idea! Because we know the universe of IDs (standard concept IDs in the vocab) we can even compute how often collisions occur beforehand.

I would like to have the same covariate IDs generated across platforms. Also, the hash should not exceed the 53 bits we have for integers in R. Any suggestions for candidate hashing algorithms? (MD5 produces 128-bit hashes)

vojtechhuser commented 4 years ago

This is very good discussion. I am wandering how we solved this issue in value_as_number. Did we pre-coordinate the ranges so that they fit 53 bits.

One slight challenge is how to construct covariate IDs. We only have 53 bits in R to represent an integer. 32 bits are needed for the main concept ID. 3 digits (roughly 10 bits) are used to encode the covariate analysis ID. That leaves 11 bits (3 more digits) to encode the value_as_concept_id

I think we are concatenating the numbers and treat them as "strings". If we adopt a different paradigm, the 53 limit will go away. (analysisID slapped next to conceptid to create a new number).

schuemie commented 4 years ago

For value_as_number the value simply is the (continuous) value of the covariate.

However, for value_as_concept_id we need a binary covariate per concept_id, and each covariate needs a unique covariate ID.

Changing the covariate IDs to strings would basically break every tool we have.

schuemie commented 3 years ago

The upcoming version of DatabaseConnector will introduce 64-bit integers, so we should have sufficient precision to generate unique covariate IDs for the observation_concept_id, value_as_concept_id, and analysis_id combinations.

schuemie commented 3 years ago

Note that the new version of DatabaseConnector does not output 64-bit integers, because I found there were too many issues with integer64 R that would break our apps down the line. Integers should still be < 2^53 to avoid precision loss, but at least now DatabaseConnector will throw an error is there is precision loss.

schuemie commented 4 months ago

@ginberg, @anthonysena : The vocab team is making some major changes to the vocab, and likely will introduce many more of these 'postcoordinated' concepts. The urgency for supporting this has therefore increased. Could we prioritize this?

To recap, the issue is this: In the observation and measurement tables we have observation_concept_id and measurement_concept_id resp., and both have value_as_concept_id. When value_as_concept_id is not NULL, together they define a binary feature. For example, the observation_concept_id might refer to 'History of', and 'value_as_concept_id' could refer to 'Breast cancer', and together they would indicate the patient has a history of breast cancer.

The challenge with creating features from this is the covariate ID. Normally, covariate ID = 1000 * concept ID + analysis ID, but here we have two concept IDs. In R, integers are 32-bit (the bit32 package offers the integer64 type, but it unusable for various reasons) so we use numeric to represent covariate IDs. This means we have a precision of 53 bits. A concept ID has 32 bits precision, so together more than we can store in 53 bits.

We could use hashing to get around this, but here a challenge is that there is no hashing function consistently available across all supported database platforms (hashing is not an ANSI SQL function). Perhaps we can devise a simple hashing function ourselves that can be implement in SQL what will work consistently across platforms?

Note that the reason we derive the covariate ID from the concept IDs is that we get the same covariate ID everywhere. If, for example, I build a predictive model on one database, the covariate IDs in the model will match those created in some other database. The reasons we use a numeric representation instead of a string is because that is more efficient. We pull and store a lot of covariate data (billions of records), and also do lots of processing afterwards (e.g. grouping by covariate ID, filtering, etc.).

schuemie commented 4 months ago

For example, ChatGPT suggests we can hash a value by perform XOR between the value and its right-shifted version. For example, we can convert a 32-bit integer (a concept ID) to a 16-bit hash using SQL Server SQL:

(@value ^ (@value / 65536)) & 0xFFFF

where ^ is the bitwise XOR operator, and & the bitwise AND operator.

schuemie commented 4 months ago

@dimshitc indicates we can only have the following combination of domains:

Measurement & Meas Value Observation & Any domain

All are restricted to Standard Concepts. This adds some nice constraints. We could sample from this space, and evaluate how often the hash introduces collisions.

schuemie commented 4 months ago

I wrote some code to test this idea, and it seems to work: https://github.com/OHDSI/FeatureExtraction/blob/issue67/extras/TestHashForPostcoordinatedConcepts.R

Since we have 53 bits in total, and 10 go to the analysis ID, I started with 21 bits per concept ID. For all possible standard concept IDs, the number of collisions with 21 bits is massive: 46.31%, but if I restrict to all standard concepts observed in the OHDSI network that reduces to 0.45%.

Based on this, I propose using 18 bits for the 1st concept ID (observation_concept_id or measurement_concept_id), so we have 24 for the 2nd (value_as_concept_id). That way, I don't see any collisions when restricting tot the concepts observed in OHDSI.

At the bottom of the document you'll see I implemented this idea of hashing in SQLite (I did not yet add the analysis ID). SQLite does not have a bitwise XOR, so I had to reproduce a XOR with OR and AND. This is actually nice, because the resulting SQL code is now very platform agnostic (as far as I can see, every platform implements | and &).

@ginberg , @anthonysena : Let me know if you agree with this approach, and I can draft a PR.

ginberg commented 4 months ago

hi @schuemie your approach looks good to me, great that you found a way to to execute the hash on all database platforms and that there are no collisions with 18 and 24 as bit sizes. But I am wondering how you created the universe table, how do you know which concept IDs are used in OHDSI? And what if the total number of used concept ids increases in the future and we might get a collision?

schuemie commented 4 months ago

In 2019 we performed the OHDSI Concept Prevalence study, where a large number of databases in the OHDSI network provided their concept counts. I'm using the universe table constructed in that study.

The numbers I computed are very extreme upper bounds on the probabilities of collisions; I took concept IDs that occurred across the CDM. Currently, the number of unique concepts that appear as pairs in the observation and measurement table is probably less than 100. These may become thousands in the future, still making collisions very unlikely.

A collision may at some point occur, but would in most cases not be a major issue. For example, for patient-level prediction and cohort method it would simply mean 2 (of the >100,000) covariates are in fact merged and therefore maybe not as predictive. The absolute worst case scenario would be if there's a collision for a feature we desperately want to use for characterization and show separately to the user. That is not impossible, but less likely than being hit by lightning.

ginberg commented 4 months ago

@schuemie I see, thanks for explaining! I agree with this approach then. Would be good to document this hashing functionality and that their might be a collision in the future.