OHDSI / circe-be

CIRCE is a cohort definition and syntax compiler tool for OMOP CDMv5
Apache License 2.0
9 stars 13 forks source link

Vocabulary schema is not supported #29

Closed wivern closed 6 years ago

wivern commented 6 years ago

Now it`s limited to utilize the same schema for cdm and vocabularies.

chrisknoll commented 6 years ago

That's correct. The circe expression runs off of a CDM schema (which includes vocabulary tables) not a 'vocabulary schema'. The notion of a 'vocabulary schema' doesn't actually exist, I think you are thinking about the "vocabulary daimon' in WebAPI where you can have a subset of the CDM tables (ie: the vocabulary tables) and do vocabulary queries against that daimon.

If you wanted to share vocabulary tables across different CDM schemas, the approach for that is to create views onto your 'root' vocabulary tables so that you don't need to duplicate the data. I understand your use case for wanting this feature, but it's very important to the circe expression that the patient-level data that is queried is referencing the same vocabulary tables within the same CDM.

pavgra commented 6 years ago

@chrisknoll, although we can resolve the limitation by views, what is the problem with making circe-be more flexible in this case and allowing to setup vocab schema differing from the main one? Moreover this would remove the misleading with source daimons in WebAPI.

chrisknoll commented 6 years ago

Describing this change as being 'more flexible' is a strawman. Of course I'm for improving the code by added flexibility, but this feature does 2 things:

  1. forces you to pass in a vocabulary schema
  2. opens the possibility that you can now pass an incorrect vocabulary schema that isn't the same vocabulary as the vocabulary used in the CDM you are querying. Without this feature, you can't make this mistake.

So, if we want to go forward with this to have the 'user beware' attitude where they could mismatch the vocabulary with the CDM and export a set of vocabulary tables where no documentation exists on how to set up this mode, at the very least we should not require the vocabulary schema setting. The code:

From https://github.com/OHDSI/circe-be/pull/30/files#diff-e6a86866fbd0d23e02d938576ce96f9aR441

     if (options.vocabularySchema != null)
       resultSql = StringUtils.replace(resultSql, "@vocabulary_database_schema", options.vocabularySchema);

Should do something like this:

     if (options.vocabularySchema != null)
       resultSql = StringUtils.replace(resultSql, "@vocabulary_database_schema", options.vocabularySchema);
    else if (options.cdmSchema != null)
       resultSql = StringUtils.replace(resultSql, "@vocabulary_database_schema", options.cdmSchema);

That way, existing code that doesn't set the vocabularySchema option will continue to work.

pavgra commented 6 years ago

@chrisknoll, such approach seems very reasonable. Then will you be ok with the PR, if we make the changes?

anthonysena commented 6 years ago

Hi - just chiming in here since @chrisknoll tagged me on the PR. I'm not in agreement with this premise:

Now it`s limited to utilize the same schema for cdm and vocabularies.

The standardized vocabularies are a required part of a CDM. Why do we need to be more flexible in terms of the location of where these tables reside? They are a mandatory part of the CDM schema. Plenty of other code in the various repositories assume the vocabulary tables co-exist with the other CDM tables and do not offer the flexibility to change the schema.

Can you help us understand the use-case a bit more for cohort definitions?

chrisknoll commented 6 years ago

@anthonysena I'm in complete agreement with you. With a paramaterized vocabulary option, you can load a new vocabulary without refreshing your patient-level data, and expect to get the right results, but the patient-level data depends on the vocabulary used to generate it, so the usecase of changing a vocabulary without refreshing the cdm data is an invalid one...

@pavgra: yes, if you make the vocabulary schema optional, then I don't have any objectives from a technical perspective. However, the notion that you can just point to an arbitrary vocabulary that wasn't used to generate the CDM's data is an invalid use case (ie: you'd get invalid results). In addition, we'd have to then go through all the other parts of WebAPI (estimation, incidence rates, prediction, cohort method) and make all those processes 'vocabulary aware' which is an additional burden. And by doing all of this, we're advocating the notion of separating vocabulary from CDM data which may be contrary to the CDM structure.

So, yes, at the very least, changing the PR to make the vocabulary schema optional is a requirement. However, we need to hear from the people who are technical managers of the CDM and vocabulary (@clairblacketer , @cgreich) about encouraging the separation of vocabulary tables from the CDM schema.

@clairblacketer and @cgreich : can you please add your perspective on separating vocabulary from CDM and if you've had these types of talks before, and if it is something you'd want supported in the tools (to host a 'shared' vocabulary in a separate schema from the CDM). I'm much more comfortable moving in this direction if the custodians of the CDM/Vocabulary can offer some guisance.

Thank you.

gowthamrao commented 6 years ago

Let me try to weigh in. Imagine a scenario of an organization that has 10+ OMOP CDM instances, each with it's own schema in the database.

All OMOP CDM's where built using the same vocabulary version. In current implemention, we are forced to deploy vocabulary DDL in each one of the 10+ CDMs. That duplicates data, redundant, and makes maintenance difficult. So, why not share the vocabulary between all 10+ CDMs? That's more efficient to maintain. This PR supports the use case. Vocabularies are backward compatible. So, we can easily update the vocabulary once, and all 10 CDMs are updated.

In the rare instances if we need to maintain multiple different versions of vocabulary, then the PR supports that too.

Mismatch of vocabulary version with OMOP is not a major problem IMHO if the vocabulary versions are compatible with each other.

I am in support of this change. Makes multi CDM's easier to manage.

cgreich commented 6 years ago

Friends:

We actually have 10 different CDMs. All with their own vocabulary. Many of them the same, but some lagging behind. I am with @chrisknoll, here. Even though the deviation between releases is low, the delta will make queries come back different (=wrong). Usually not a big deal, but still.

But I have no opinion what should be mandatory or optional, and whether you have tables or views. But the vocabulary that was used to ETL the data should be visible in the same way as the data tables.

@gowthamrao: Duplication: Not a big issue. The vocabulary tables are tiny in comparison to the data tables. Ease of maintenance: What we do need is an automatic update mechanism (@pavgra has that somewhere on his backlog) and, very important, an upgrader, which can take the difference between vocabularies and adjust existing cohort definitions that might use stale vocab information. I wouldn't create an auto-upgrader for the data. Just rerun the whole ETL.

clairblacketer commented 6 years ago

I agree with @chrisknoll and @anthonysena that the vocabulary is an essential part of the cdm schema and should not be separated. The way it is set up right now helps to reduce error by keeping them linked so you don't have to remember which version of the vocabulary was used to build the cdm. Often we have vocabulary versions available that have not yet been fully adopted into our process so the most recent cdm build may have been made using a vocabulary version that is not the latest. The column vocabulary_version in the CDM_SOURCE table is supposed to have a record of the version used though I don't know how often it is reliably populated.

In short, I don't think it would be a good idea to separate the cdm from the vocabulary.