OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
875 stars 447 forks source link

Possible to Enforce Common Casing across SQL DDLs? #509

Closed TheCedarPrince closed 5 months ago

TheCedarPrince commented 2 years ago

Problem background: I recently discovered that not all SQL flavors adhere to having case insensitive column names in their tables. Although according to the OMOP CDM documentation for v5.4, it would appear that implementations across database have uppercase table names and lowercase column names within the given tables. However, it seems that database DDLs create variable case names across database tables and column name across SQL implementation versions therefore making this implementation aspect inconsistent. I spoke with @clairblacketer on this and she suggested I open an issue as she saw this as a valid observation as it does not enforce a specific technology but rather a specification on database creation and that it did not need to go through the formal proposal process to alter the CDM.

Use Case: I am developing a package that expects column names to be lowercase and table names to be uppercase. I was building some tests across different OMOP CDM databases running in different SQL flavors and saw that some tests would fail as it would appear that case sensitivity enforcement differed across SQL flavors in database and column names.

Proposed Solution: Across DDLs, could it be enforced that when tables and associated column names are created, there is consistent casing? I do not know if across SQL flavors there is special reservations concerning casing of tables and column naming, but if we could commit to ensuring tables always are capitalized and column names are lowercase, that would be excellent.

Alternatives Considered: In my work, I could create workarounds per each SQL flavor but if DDLs change, my system would prove to be very brittle. Having an enforced casing scheme would reduce headaches and ensure non-breaking changes across the OHDSI ecosystem as well as my work concerning casings.

Thanks! :smile:

ablack3 commented 2 years ago

Thanks for bringing this up @TheCedarPrince. One thing to consider is if case sensitivity should be part of the CDM specification. BigQuery must be case sensitive while Redshift is not case case sensitive by default and might require quoting of column names to support case sensitivity. Would it be possible to support both case sensitive and case insensitive CDM implementations in a consistent way? For example if you're using a case insensitive Redshift database table names could be lower case (the default) but if you're using BigQuery (case sensitive) table names need to be uppercase.

TheCedarPrince commented 2 years ago

Hey @ablack3 - this is fascinating. The only reason I did not think to propose it as a formal CDM requirement was the fact that I did not want to seem like the CDM was giving particular preference to one technology over another. Now, with what you are saying, it may require an elevation. Thinking through this a bit further, here is a draft table of SQL flavors with casings:

SQL/Database Flavor Default Table Casing Default Column Casing Available Table Casing Available Column Casing
PostgreSQL Lowercase Lowercase Uppercase (quoted), Lowercase Uppercase (quoted), Lowercase
SQLite Uppercase, Lowercase Uppercase, Lowercase Uppercase, Lowercase Uppercase, Lowercase
Redshift Lowercase Lowercase Uppercase (DB options), Lowercase Uppercase (DB options), Lowercase
BigQuery Lowercase Lowercase Uppercase (UPPER, DB options), Lowercase Uppercase (UPPER, DB options), Lowercase

Key: SQL/Database Flavor are those flavors so far supported in OHDSI, Default Table Casing and Default Column Casing are the defaults on how things are cased when created by these databases, Available Table Casing and Available Column Casing are if and how certain casings are available by databases.

If this is useful, we could keep building this table out, but at the moment, it seems like lowercase is the common default across all SQL/Database flavors as of this moment. Talking through this more, I do agree that this is something we should stabilize a guideline for. Thoughts (both Adam and CDM team)?

mikepsinn commented 2 years ago

There's a clear downside of inconsistency with permitting uppercase. Is there any upside to uppercase that exceeds it?

TheCedarPrince commented 2 years ago

I will defer to @ablack3 here @mikepsinn but from what I can see, I do not see any negatives to enforcing lowercase naming conventions as all the flavors of SQL and other databases I have investigated all support methods for lowercasing and often, upper casing requires more configuration that lowercase anyhow.

ablack3 commented 2 years ago

It looks to me like both the specification and the DDL use upper case for table names and lower case for field names.

However, it seems that database DDLs create variable case names across database tables and column name across SQL implementation versions therefore making this implementation aspect inconsistent.

Where is the inconsistency being introduced?

I'm thinking the inconsistency is introduced by OHDSI supporting both case sensitive and case insensitive databases. SQL code that works on case insensitive databases will not necessarily work on case sensitive databases. But SQL code that works on case sensitive CDMs should also work on case insensitive CDMs, right?

@TheCedarPrince - I think a concrete example of an error your getting would be helpful.

# devtools::install_github("OHDSI/CommonDataModel")

ddl <- CommonDataModel::createDdl("5.4")

cat(ddl)
#> --@targetDialect CDM DDL Specification for OMOP Common Data Model 5.4
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.PERSON (
#>          person_id integer NOT NULL,
#>          gender_concept_id integer NOT NULL,
#>          year_of_birth integer NOT NULL,
#>          month_of_birth integer NULL,
#>          day_of_birth integer NULL,
#>          birth_datetime datetime NULL,
#>          race_concept_id integer NOT NULL,
#>          ethnicity_concept_id integer NOT NULL,
#>          location_id integer NULL,
#>          provider_id integer NULL,
#>          care_site_id integer NULL,
#>          person_source_value varchar(50) NULL,
#>          gender_source_value varchar(50) NULL,
#>          gender_source_concept_id integer NULL,
#>          race_source_value varchar(50) NULL,
#>          race_source_concept_id integer NULL,
#>          ethnicity_source_value varchar(50) NULL,
#>          ethnicity_source_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.OBSERVATION_PERIOD (
#>          observation_period_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          observation_period_start_date date NOT NULL,
#>          observation_period_end_date date NOT NULL,
#>          period_type_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.VISIT_OCCURRENCE (
#>          visit_occurrence_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          visit_concept_id integer NOT NULL,
#>          visit_start_date date NOT NULL,
#>          visit_start_datetime datetime NULL,
#>          visit_end_date date NOT NULL,
#>          visit_end_datetime datetime NULL,
#>          visit_type_concept_id Integer NOT NULL,
#>          provider_id integer NULL,
#>          care_site_id integer NULL,
#>          visit_source_value varchar(50) NULL,
#>          visit_source_concept_id integer NULL,
#>          admitted_from_concept_id integer NULL,
#>          admitted_from_source_value varchar(50) NULL,
#>          discharged_to_concept_id integer NULL,
#>          discharged_to_source_value varchar(50) NULL,
#>          preceding_visit_occurrence_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.VISIT_DETAIL (
#>          visit_detail_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          visit_detail_concept_id integer NOT NULL,
#>          visit_detail_start_date date NOT NULL,
#>          visit_detail_start_datetime datetime NULL,
#>          visit_detail_end_date date NOT NULL,
#>          visit_detail_end_datetime datetime NULL,
#>          visit_detail_type_concept_id integer NOT NULL,
#>          provider_id integer NULL,
#>          care_site_id integer NULL,
#>          visit_detail_source_value varchar(50) NULL,
#>          visit_detail_source_concept_id Integer NULL,
#>          admitted_from_concept_id Integer NULL,
#>          admitted_from_source_value varchar(50) NULL,
#>          discharged_to_source_value varchar(50) NULL,
#>          discharged_to_concept_id integer NULL,
#>          preceding_visit_detail_id integer NULL,
#>          parent_visit_detail_id integer NULL,
#>          visit_occurrence_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.CONDITION_OCCURRENCE (
#>          condition_occurrence_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          condition_concept_id integer NOT NULL,
#>          condition_start_date date NOT NULL,
#>          condition_start_datetime datetime NULL,
#>          condition_end_date date NULL,
#>          condition_end_datetime datetime NULL,
#>          condition_type_concept_id integer NOT NULL,
#>          condition_status_concept_id integer NULL,
#>          stop_reason varchar(20) NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          condition_source_value varchar(50) NULL,
#>          condition_source_concept_id integer NULL,
#>          condition_status_source_value varchar(50) NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.DRUG_EXPOSURE (
#>          drug_exposure_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          drug_concept_id integer NOT NULL,
#>          drug_exposure_start_date date NOT NULL,
#>          drug_exposure_start_datetime datetime NULL,
#>          drug_exposure_end_date date NOT NULL,
#>          drug_exposure_end_datetime datetime NULL,
#>          verbatim_end_date date NULL,
#>          drug_type_concept_id integer NOT NULL,
#>          stop_reason varchar(20) NULL,
#>          refills integer NULL,
#>          quantity float NULL,
#>          days_supply integer NULL,
#>          sig varchar(MAX) NULL,
#>          route_concept_id integer NULL,
#>          lot_number varchar(50) NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          drug_source_value varchar(50) NULL,
#>          drug_source_concept_id integer NULL,
#>          route_source_value varchar(50) NULL,
#>          dose_unit_source_value varchar(50) NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.PROCEDURE_OCCURRENCE (
#>          procedure_occurrence_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          procedure_concept_id integer NOT NULL,
#>          procedure_date date NOT NULL,
#>          procedure_datetime datetime NULL,
#>          procedure_end_date date NULL,
#>          procedure_end_datetime datetime NULL,
#>          procedure_type_concept_id integer NOT NULL,
#>          modifier_concept_id integer NULL,
#>          quantity integer NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          procedure_source_value varchar(50) NULL,
#>          procedure_source_concept_id integer NULL,
#>          modifier_source_value varchar(50) NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.DEVICE_EXPOSURE (
#>          device_exposure_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          device_concept_id integer NOT NULL,
#>          device_exposure_start_date date NOT NULL,
#>          device_exposure_start_datetime datetime NULL,
#>          device_exposure_end_date date NULL,
#>          device_exposure_end_datetime datetime NULL,
#>          device_type_concept_id integer NOT NULL,
#>          unique_device_id varchar(255) NULL,
#>          production_id varchar(255) NULL,
#>          quantity integer NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          device_source_value varchar(50) NULL,
#>          device_source_concept_id integer NULL,
#>          unit_concept_id integer NULL,
#>          unit_source_value varchar(50) NULL,
#>          unit_source_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.MEASUREMENT (
#>          measurement_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          measurement_concept_id integer NOT NULL,
#>          measurement_date date NOT NULL,
#>          measurement_datetime datetime NULL,
#>          measurement_time varchar(10) NULL,
#>          measurement_type_concept_id integer NOT NULL,
#>          operator_concept_id integer NULL,
#>          value_as_number float NULL,
#>          value_as_concept_id integer NULL,
#>          unit_concept_id integer NULL,
#>          range_low float NULL,
#>          range_high float NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          measurement_source_value varchar(50) NULL,
#>          measurement_source_concept_id integer NULL,
#>          unit_source_value varchar(50) NULL,
#>          unit_source_concept_id integer NULL,
#>          value_source_value varchar(50) NULL,
#>          measurement_event_id integer NULL,
#>          meas_event_field_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.OBSERVATION (
#>          observation_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          observation_concept_id integer NOT NULL,
#>          observation_date date NOT NULL,
#>          observation_datetime datetime NULL,
#>          observation_type_concept_id integer NOT NULL,
#>          value_as_number float NULL,
#>          value_as_string varchar(60) NULL,
#>          value_as_concept_id Integer NULL,
#>          qualifier_concept_id integer NULL,
#>          unit_concept_id integer NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          observation_source_value varchar(50) NULL,
#>          observation_source_concept_id integer NULL,
#>          unit_source_value varchar(50) NULL,
#>          qualifier_source_value varchar(50) NULL,
#>          value_source_value varchar(50) NULL,
#>          observation_event_id integer NULL,
#>          obs_event_field_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.DEATH (
#>          person_id integer NOT NULL,
#>          death_date date NOT NULL,
#>          death_datetime datetime NULL,
#>          death_type_concept_id integer NULL,
#>          cause_concept_id integer NULL,
#>          cause_source_value varchar(50) NULL,
#>          cause_source_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.NOTE (
#>          note_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          note_date date NOT NULL,
#>          note_datetime datetime NULL,
#>          note_type_concept_id integer NOT NULL,
#>          note_class_concept_id integer NOT NULL,
#>          note_title varchar(250) NULL,
#>          note_text varchar(MAX) NOT NULL,
#>          encoding_concept_id integer NOT NULL,
#>          language_concept_id integer NOT NULL,
#>          provider_id integer NULL,
#>          visit_occurrence_id integer NULL,
#>          visit_detail_id integer NULL,
#>          note_source_value varchar(50) NULL,
#>          note_event_id integer NULL,
#>          note_event_field_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.NOTE_NLP (
#>          note_nlp_id integer NOT NULL,
#>          note_id integer NOT NULL,
#>          section_concept_id integer NULL,
#>          snippet varchar(250) NULL,
#>          "offset" varchar(50) NULL,
#>          lexical_variant varchar(250) NOT NULL,
#>          note_nlp_concept_id integer NULL,
#>          note_nlp_source_concept_id integer NULL,
#>          nlp_system varchar(250) NULL,
#>          nlp_date date NOT NULL,
#>          nlp_datetime datetime NULL,
#>          term_exists varchar(1) NULL,
#>          term_temporal varchar(50) NULL,
#>          term_modifiers varchar(2000) NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.SPECIMEN (
#>          specimen_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          specimen_concept_id integer NOT NULL,
#>          specimen_type_concept_id integer NOT NULL,
#>          specimen_date date NOT NULL,
#>          specimen_datetime datetime NULL,
#>          quantity float NULL,
#>          unit_concept_id integer NULL,
#>          anatomic_site_concept_id integer NULL,
#>          disease_status_concept_id integer NULL,
#>          specimen_source_id varchar(50) NULL,
#>          specimen_source_value varchar(50) NULL,
#>          unit_source_value varchar(50) NULL,
#>          anatomic_site_source_value varchar(50) NULL,
#>          disease_status_source_value varchar(50) NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.FACT_RELATIONSHIP (
#>          domain_concept_id_1 integer NOT NULL,
#>          fact_id_1 integer NOT NULL,
#>          domain_concept_id_2 integer NOT NULL,
#>          fact_id_2 integer NOT NULL,
#>          relationship_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.LOCATION (
#>          location_id integer NOT NULL,
#>          address_1 varchar(50) NULL,
#>          address_2 varchar(50) NULL,
#>          city varchar(50) NULL,
#>          state varchar(2) NULL,
#>          zip varchar(9) NULL,
#>          county varchar(20) NULL,
#>          location_source_value varchar(50) NULL,
#>          country_concept_id integer NULL,
#>          country_source_value varchar(80) NULL,
#>          latitude float NULL,
#>          longitude float NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CARE_SITE (
#>          care_site_id integer NOT NULL,
#>          care_site_name varchar(255) NULL,
#>          place_of_service_concept_id integer NULL,
#>          location_id integer NULL,
#>          care_site_source_value varchar(50) NULL,
#>          place_of_service_source_value varchar(50) NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.PROVIDER (
#>          provider_id integer NOT NULL,
#>          provider_name varchar(255) NULL,
#>          npi varchar(20) NULL,
#>          dea varchar(20) NULL,
#>          specialty_concept_id integer NULL,
#>          care_site_id integer NULL,
#>          year_of_birth integer NULL,
#>          gender_concept_id integer NULL,
#>          provider_source_value varchar(50) NULL,
#>          specialty_source_value varchar(50) NULL,
#>          specialty_source_concept_id integer NULL,
#>          gender_source_value varchar(50) NULL,
#>          gender_source_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.PAYER_PLAN_PERIOD (
#>          payer_plan_period_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          payer_plan_period_start_date date NOT NULL,
#>          payer_plan_period_end_date date NOT NULL,
#>          payer_concept_id integer NULL,
#>          payer_source_value varchar(50) NULL,
#>          payer_source_concept_id integer NULL,
#>          plan_concept_id integer NULL,
#>          plan_source_value varchar(50) NULL,
#>          plan_source_concept_id integer NULL,
#>          sponsor_concept_id integer NULL,
#>          sponsor_source_value varchar(50) NULL,
#>          sponsor_source_concept_id integer NULL,
#>          family_source_value varchar(50) NULL,
#>          stop_reason_concept_id integer NULL,
#>          stop_reason_source_value varchar(50) NULL,
#>          stop_reason_source_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.COST (
#>          cost_id integer NOT NULL,
#>          cost_event_id integer NOT NULL,
#>          cost_domain_id varchar(20) NOT NULL,
#>          cost_type_concept_id integer NOT NULL,
#>          currency_concept_id integer NULL,
#>          total_charge float NULL,
#>          total_cost float NULL,
#>          total_paid float NULL,
#>          paid_by_payer float NULL,
#>          paid_by_patient float NULL,
#>          paid_patient_copay float NULL,
#>          paid_patient_coinsurance float NULL,
#>          paid_patient_deductible float NULL,
#>          paid_by_primary float NULL,
#>          paid_ingredient_cost float NULL,
#>          paid_dispensing_fee float NULL,
#>          payer_plan_period_id integer NULL,
#>          amount_allowed float NULL,
#>          revenue_code_concept_id integer NULL,
#>          revenue_code_source_value varchar(50) NULL,
#>          drg_concept_id integer NULL,
#>          drg_source_value varchar(3) NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.DRUG_ERA (
#>          drug_era_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          drug_concept_id integer NOT NULL,
#>          drug_era_start_date date NOT NULL,
#>          drug_era_end_date date NOT NULL,
#>          drug_exposure_count integer NULL,
#>          gap_days integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.DOSE_ERA (
#>          dose_era_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          drug_concept_id integer NOT NULL,
#>          unit_concept_id integer NOT NULL,
#>          dose_value float NOT NULL,
#>          dose_era_start_date date NOT NULL,
#>          dose_era_end_date date NOT NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.CONDITION_ERA (
#>          condition_era_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          condition_concept_id integer NOT NULL,
#>          condition_era_start_date date NOT NULL,
#>          condition_era_end_date date NOT NULL,
#>          condition_occurrence_count integer NULL );
#> 
#> --HINT DISTRIBUTE ON KEY (person_id)
#> CREATE TABLE @cdmDatabaseSchema.EPISODE (
#>          episode_id integer NOT NULL,
#>          person_id integer NOT NULL,
#>          episode_concept_id integer NOT NULL,
#>          episode_start_date date NOT NULL,
#>          episode_start_datetime datetime NULL,
#>          episode_end_date date NULL,
#>          episode_end_datetime datetime NULL,
#>          episode_parent_id integer NULL,
#>          episode_number integer NULL,
#>          episode_object_concept_id integer NOT NULL,
#>          episode_type_concept_id integer NOT NULL,
#>          episode_source_value varchar(50) NULL,
#>          episode_source_concept_id integer NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.EPISODE_EVENT (
#>          episode_id integer NOT NULL,
#>          event_id integer NOT NULL,
#>          episode_event_field_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.METADATA (
#>          metadata_id integer NOT NULL,
#>          metadata_concept_id integer NOT NULL,
#>          metadata_type_concept_id integer NOT NULL,
#>          name varchar(250) NOT NULL,
#>          value_as_string varchar(250) NULL,
#>          value_as_concept_id integer NULL,
#>          value_as_number float NULL,
#>          metadata_date date NULL,
#>          metadata_datetime datetime NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CDM_SOURCE (
#>          cdm_source_name varchar(255) NOT NULL,
#>          cdm_source_abbreviation varchar(25) NOT NULL,
#>          cdm_holder varchar(255) NOT NULL,
#>          source_description varchar(MAX) NULL,
#>          source_documentation_reference varchar(255) NULL,
#>          cdm_etl_reference varchar(255) NULL,
#>          source_release_date date NOT NULL,
#>          cdm_release_date date NOT NULL,
#>          cdm_version varchar(10) NULL,
#>          cdm_version_concept_id integer NOT NULL,
#>          vocabulary_version varchar(20) NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CONCEPT (
#>          concept_id integer NOT NULL,
#>          concept_name varchar(255) NOT NULL,
#>          domain_id varchar(20) NOT NULL,
#>          vocabulary_id varchar(20) NOT NULL,
#>          concept_class_id varchar(20) NOT NULL,
#>          standard_concept varchar(1) NULL,
#>          concept_code varchar(50) NOT NULL,
#>          valid_start_date date NOT NULL,
#>          valid_end_date date NOT NULL,
#>          invalid_reason varchar(1) NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.VOCABULARY (
#>          vocabulary_id varchar(20) NOT NULL,
#>          vocabulary_name varchar(255) NOT NULL,
#>          vocabulary_reference varchar(255) NULL,
#>          vocabulary_version varchar(255) NULL,
#>          vocabulary_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.DOMAIN (
#>          domain_id varchar(20) NOT NULL,
#>          domain_name varchar(255) NOT NULL,
#>          domain_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CONCEPT_CLASS (
#>          concept_class_id varchar(20) NOT NULL,
#>          concept_class_name varchar(255) NOT NULL,
#>          concept_class_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CONCEPT_RELATIONSHIP (
#>          concept_id_1 integer NOT NULL,
#>          concept_id_2 integer NOT NULL,
#>          relationship_id varchar(20) NOT NULL,
#>          valid_start_date date NOT NULL,
#>          valid_end_date date NOT NULL,
#>          invalid_reason varchar(1) NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.RELATIONSHIP (
#>          relationship_id varchar(20) NOT NULL,
#>          relationship_name varchar(255) NOT NULL,
#>          is_hierarchical varchar(1) NOT NULL,
#>          defines_ancestry varchar(1) NOT NULL,
#>          reverse_relationship_id varchar(20) NOT NULL,
#>          relationship_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CONCEPT_SYNONYM (
#>          concept_id integer NOT NULL,
#>          concept_synonym_name varchar(1000) NOT NULL,
#>          language_concept_id integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.CONCEPT_ANCESTOR (
#>          ancestor_concept_id integer NOT NULL,
#>          descendant_concept_id integer NOT NULL,
#>          min_levels_of_separation integer NOT NULL,
#>          max_levels_of_separation integer NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.SOURCE_TO_CONCEPT_MAP (
#>          source_code varchar(50) NOT NULL,
#>          source_concept_id integer NOT NULL,
#>          source_vocabulary_id varchar(20) NOT NULL,
#>          source_code_description varchar(255) NULL,
#>          target_concept_id integer NOT NULL,
#>          target_vocabulary_id varchar(20) NOT NULL,
#>          valid_start_date date NOT NULL,
#>          valid_end_date date NOT NULL,
#>          invalid_reason varchar(1) NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.DRUG_STRENGTH (
#>          drug_concept_id integer NOT NULL,
#>          ingredient_concept_id integer NOT NULL,
#>          amount_value float NULL,
#>          amount_unit_concept_id integer NULL,
#>          numerator_value float NULL,
#>          numerator_unit_concept_id integer NULL,
#>          denominator_value float NULL,
#>          denominator_unit_concept_id integer NULL,
#>          box_size integer NULL,
#>          valid_start_date date NOT NULL,
#>          valid_end_date date NOT NULL,
#>          invalid_reason varchar(1) NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.COHORT (
#>          cohort_definition_id integer NOT NULL,
#>          subject_id integer NOT NULL,
#>          cohort_start_date date NOT NULL,
#>          cohort_end_date date NOT NULL );
#> 
#> --HINT DISTRIBUTE ON RANDOM
#> CREATE TABLE @cdmDatabaseSchema.COHORT_DEFINITION (
#>          cohort_definition_id integer NOT NULL,
#>          cohort_definition_name varchar(255) NOT NULL,
#>          cohort_definition_description varchar(MAX) NULL,
#>          definition_type_concept_id integer NOT NULL,
#>          cohort_definition_syntax varchar(MAX) NULL,
#>          subject_concept_id integer NOT NULL,
#>          cohort_initiation_date date NULL );

Created on 2022-08-04 by the reprex package (v2.0.1)

gklebanov commented 2 years ago

Personally, I think we should just explicitly state that the OMOP host database must not be enforcing case sensitivity. Once you start dictating that SQL should be case sensitive - it is a slippery slope of never ending mistakes to be made.

ablack3 commented 2 years ago

the OMOP host database must not be enforcing case sensitivity.

This assumes that all supported database systems can be made case insensitive. That might be true but I'm not sure. bigquery table names are always case sensitive right?

TheCedarPrince commented 2 years ago

Hey @ablack3,

Sure! In my case, I came across this error initially when working with the Eunomia test dataset. I was expecting the table names to be capitalized and the column names to be lowercase - this was not so as it was actually inverted with tables being lowercase and columns being upper case. I traced this back to ETL-Synthea and sure enough, the code there was the culprit causing this behavior even though this was coded after the CDM v5.3 and v5.4 came out.

Going back to reproduce an error on my side, I realized that this was more or less an error specific to a SQL tool I was using to interface with a DB and not a specific flavor like I thought. (see photo below)

image

However, this inconsistency in the ecosystem I think is still a problem as to your point

This assumes that all supported database systems can be made case insensitive. That might be true but I'm not sure.

Based on my previous comment, this is not so from the research I have done.

Personally, I think we should just explicitly state that the OMOP host database must not be enforcing case sensitivity. Once you start dictating that SQL should be case sensitive - it is a slippery slope of never ending mistakes to be made.

I do see where your concern is coming from @gklebanov but in this case, we are not really discussing SQL being case sensitive but the Schema naming convention used in OMOP CDM being case sensitive. I agree that it could be a bad path if we chose to enforce SQL query statements a certain way, but this is not what the original proposal was about - it's about consistency in the schema.

And after chatting with Adam and tracing back the issue, it does seem like this casing is already loosely recommended in the specification (though never explicitly required) but is not as consistently followed as I would hope leading to me raising this issue.

ablack3 commented 2 years ago

Here's some experimentation in R with Eunomia. I think the upper cases in Eunomia column names should be a simple fix somewhere in the ETL. I'm not sure where this is happening since it looks to me like ETLSynthea is using the DDL created by the CommonDataModel package which uses lower case column names.

I think the trickier issue is OHDSI support for both case sensitive and case insensitive databases.

library(Eunomia)
#> Loading required package: DatabaseConnector
cd <- getEunomiaConnectionDetails()

con <- connect(cd)
#> Connecting using SQLite driver

# Sqlite seems to be case insensitive
df <- dbGetQuery(con, "SELECT concept_id FROM main.CONCEPT LIMIT 5")
df2 <- dbGetQuery(con, "SELECT CONCEPT_ID FROM main.concept LIMIT 5")

all.equal(df, df2)
#> [1] TRUE

# but the names are stored as uppercase
names(df)
#> [1] "CONCEPT_ID"

df3 <- dbGetQuery(con, "SELECT CONCEPT_ID as concept_id FROM main.concept LIMIT 5")
names(df3)
#> [1] "concept_id"

Created on 2022-08-04 by the reprex package (v2.0.1)

clairblacketer commented 2 years ago

@TheCedarPrince @ablack3 I am less familiar with the implications of supporting both case sensitive and case insensitive, especially in the downstream tools, but we did discuss today in the CDM working group moving to all lowercase for both tables and columns if that would help the issue at all.

ablack3 commented 2 years ago

Thanks for bringing it up in the CDM workgroup @clairblacketer. Would you say that there is an explicit assumption that CDM related table and schema names on bigquery are lower case?

ablack3 commented 2 years ago

If a CDM database is case sensitive it would be helpful to have a convention about the case so that we all know what it should be. Can we avoid supporting both upper and lower case table/field names on case sensitive databases?

ablack3 commented 1 year ago

we did discuss today in the CDM working group moving to all lowercase for both tables and columns if that would help the issue at all.

Hi @clairblacketer, is the plan to move explicitly use lowercase table and field names in the CDM specification?

clairblacketer commented 1 year ago

Hi @ablack3 yes, that is the plan though we haven't moved to implementation yet. Is there a timeline you all are looking to have this completed?

ablack3 commented 1 year ago

No timeline for me. I'm just trying to understand the conventions around casing on the database side and how DatabaseConnector handles casing of table and column names. I think an all lowercase convention would be helpful.

clairblacketer commented 5 months ago

this has been implemented and now all tables and fields are lowercase

TheCedarPrince commented 5 months ago

Yay! Thanks @clairblacketer !