OHDSI / OMOP-Standardized-Vocabularies

This repository is not longer active. It used to have the only purpose of creating releases of the Standardized Vocabularies, i.e. the content, not those of the Pallas Vocabulary Build System itself. As of 17-July-2018, vocabulary releases are also processed by Pallas. Please visit https://github.com/OHDSI/Vocabulary-v5.0/releases.
13 stars 6 forks source link

Could we introduce vocab version numbers? #5

Closed schuemie closed 6 years ago

schuemie commented 8 years ago

Instead of relying on the release dates as identifiers, could we introduce vocab version numbers?

And maybe the version number can be stored somewhere in the release itself? (Maybe add a vocab_source table?)

ericaVoss commented 8 years ago

To address your second question, the version is in the VOCABULARY table:

SELECT *
FROM VOCABULARY
WHERE VOCABULARY_ID = 'none'
cgreich commented 8 years ago

How? We already have a version number: 4.5 and 5.0. That version is for merging the Vocabularies with the right CDM.

schuemie commented 8 years ago

But those are really CDM version numbers, not vocab versions. One could even argue v4.5 or v5.0 merely indicates the format of the vocab.

We could stick to the current de facto version identifiers which are the release dates (e.g. '11-March-2016'), but it is odd that we only do that for the vocab, and all other OHDSI artifacts have regular version numbers. It is certainly not intuitive to most users.

One idea is to combine the CDM version with the vocab version into one identifier, e.g. 'V5.0.AA', 'v5.0.AB', etc.

gracebrownecodes commented 8 years ago

👍 for this idea. Specifically in the space of automating operations on and between OHDSI datasets, it would be very useful to be able to programmatically inspect (discover from the data) and compare (which is greater, are they incompatible) specific (semantic) vocabulary versions.

cgreich commented 8 years ago

Ok. We'll cook something up.

Compatibility: That's an interesting question. Because generally the versions are 98% identical and compatible. But if you need a certain information that happens to be in the 2% you are screwed. Not sure how to get that right.

gracebrownecodes commented 8 years ago

From my perspective, the most important question is whether all of the concept ids from a previous vocabulary version exist in a new version.

Based on that, the interpretation of the semantic versioning specification could be:

  1. MAJOR version changes when concept ids are removed, because this is a "breaking change".
  2. MINOR version changes when new concepts are added, because this is a "feature addition".
  3. PATH version changes when existing concept records are changed, because this is a "bug fix".

I agree with @schuemie that the CDM version numbers are more like format identifiers.

cgreich commented 8 years ago

@aaron0browne:

The concept_ids are all preserved. Except in very rare cases (egregious errors or duplications) they never die. Such a case hasn't happened yet.

But to your three suggestions: In each release there are concepts added, removed (set to invalid, not really removed) and changed. So, unless you have a good idea your schema would not work. It's not software.

gracebrownecodes commented 8 years ago

So then what is the 2% you referred to above? Concepts that are set to invalid?

cgreich commented 8 years ago

Yes, those, and added ones, and changed ones. The individual concepts kind of have already a version: The valid_start_date, valid_end_date and invalid_reason. But we are talking about the Vocabulary System as a whole.

schuemie commented 8 years ago

For me the most important thing is that there is an explicit version ID. For example, I don't know which version my friends at ErasmusMC or TMU are using (even though they're both on CDM v5), and until Erica taught me the trick of looking up the vocabulary_version using vocabulary = 'none' in the vocabulary table, I had no way of finding out.

That being said, some updates are more profound than others. For example, from v4.4 to v4.5 (aka v5.0) the entire ICD10-to-standard (SNOMED) mapping was replaced. But that doesn't mean people shouldn't update their ETL even after a minor update. So not sure semantic versioning is needed here.

cgreich commented 8 years ago

Ok. So, I hear several complaints:

  1. There needs to be a distinct version ID. Currently it is the date. Sounds like unique to me. Let me know if it is not good.
  2. There needs to be a more obvious way to find out the version. That is a good point. We should add the version to the Athena website and to the name of the zip file folks are downloading.
  3. There needs to be a way to know what changed, whether things are just updated routinely (<10% change) or whether major construction happened to a certain vocabulary. That information is in the release notes you can find in the release repo. Let me know whether you'd want a better version.

Not sure what V4.4 means. We have been on 4.5/5.0 for more than 2 years.

schuemie commented 8 years ago
  1. Fine with the date as the version ID
  2. Yes, we should make people more aware of the version they're running. Using the date as ID is actually good, because that will automatically make it clear when people are using a really old version ;-) I'm also suggesting showing the vocab version in Achilles
  3. Yes, I'm happy with the release notes

v4.4: this confusion proves my point ;-)

cgreich commented 8 years ago

You got it. Will be done.

ericaVoss commented 8 years ago

I have felt the build version with distinct version ID of a date (e.g. V5.0 - 20160311) has worked well.

But +1 on making the version more apparent in different places. Right now when I pull from ATHENA I learn the version when I open up the VOCABULARY text file - additionally it is often hidden in tools like ATLAS and ACHILLES.

pbr6cornell commented 8 years ago

When we build a cdm, we could but the vocab version in the cdm _ source table and then expose all contents from this table on an opening page of achilles... On May 25, 2016 10:05 AM, "ericaVoss" notifications@github.com wrote:

I have felt the build version with distinct version ID of a date (e.g. V5.0 - 20160311) has worked well.

But +1 on making the version more apparent in different places. Right now when I pull from ATHENA I learn the version when I open up the VOCABULARY text file - additionally it is often hidden in tools like ATLAS and ACHILLES.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/OHDSI/OMOP-Standardized-Vocabularies/issues/5#issuecomment-221586272

ericaVoss commented 8 years ago

@pbr6cornell - but a CDM should have the Vocabulary used to build embedded inside it - so the information is already in the VOCABULARY table. But the addition of exposing CDM_SOURCE information is a good one.

schuemie commented 8 years ago

Hmmm, the cdm_source table indeed has a vocabulary_version field, so this is redundant with the record in the vocabulary table. But I wouldn't trust people to fill in the cdm_source table ;-)

ericaVoss commented 8 years ago

@schuemie - ha, actually you are right, when we do the build we were querying the VOCAB table to populate the column.

cgreich commented 8 years ago

Let's enforce that. Achilles could check and whine if the cdm_source doesn't contain that information.

vojtechhuser commented 7 years ago

so this issue can be closed now. Who are the admins?