OHDSI / DataQualityDashboard

A tool to help improve data quality standards in observational data science.
https://ohdsi.github.io/DataQualityDashboard
Apache License 2.0
145 stars 96 forks source link

Incorrect quotes in CAST usage in PostgreSQL #81

Closed SulevR closed 4 years ago

SulevR commented 4 years ago

Tried to run it on PostgreSQL database, but most of the SQL scripts end up with error like 'ERROR: column "CAST(birth_datetime AS DATE)" does not exist'.

It seems that the produced SQL is not correct - uses quotes where it should not: ...select case when count("CAST(birth_datetime AS DATE)") = 0 ...

I tested that the same script without quotes works well: ...select case when count(CAST(birth_datetime AS DATE)) = 0 ...

Full stacktrace here:

DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: column "CAST(birth_datetime AS DATE)" does not exist
  Position: 219

SQL:

/*********
FIELD LEVEL check:
CDM_FIELD - verify the field exists

Parameters used in this template:
cdmDatabaseSchema = ohdsi
cdmTableName = PERSON
cdmFieldName = CAST(birth_datetime AS DATE)

**********/

SELECT num_violated_rows, CASE WHEN denominator.num_rows = 0 THEN 0 ELSE 1.0*num_violated_rows/denominator.num_rows END AS pct_violated_rows
FROM
(
  select num_violated_rows from
  (
    select 
      case when count("CAST(birth_datetime AS DATE)") = 0 then 0
      else 0
    end as num_violated_rows
    from ohdsi.PERSON
  ) violated_rows
) violated_row_count,
( 
    SELECT 1 as num_rows
) denominator
;

R version:
R version 3.6.2 (2019-12-12)

Platform:
x86_64-apple-darwin15.6.0

Attached base packages:
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base

Other attached packages:
- magrittr (1.5)
- snow (0.4-3)
deebowden commented 4 years ago

This is occurring in Sql Server as well. SQL Server : Microsoft SQL Server 2016 (SP2-CU10) (KB4524334) - 13.0.5492.2 (X64) Oct 4 2019 19:14:08 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows Server 2016 Datacenter 10.0 (Build 14393: ) (Hypervisor)

R packages: see bottom

with the following fields: CDM_TABLE_NAME CDM_FIELD_NAME CONDITION_ERA CAST(condition_era_start_date AS DATE) CONDITION_ERA CAST(condition_era_end_date AS DATE) CONDITION_OCCURRENCE CAST(condition_start_datetime AS DATE) CONDITION_OCCURRENCE CAST(condition_end_datetime AS DATE) DEVICE_EXPOSURE CAST(device_exposure_start_datetime AS DATE) DEVICE_EXPOSURE CAST(device_exposure_end_datetime AS DATE) DOSE_ERA CAST(dose_era_start_date AS DATE) DOSE_ERA CAST(dose_era_end_date AS DATE) DRUG_ERA CAST(drug_era_start_date AS DATE) DRUG_ERA CAST(drug_era_end_date AS DATE) DRUG_EXPOSURE CAST(drug_exposure_start_datetime AS DATE) DRUG_EXPOSURE CAST(drug_exposure_end_datetime AS DATE) NOTE CAST(note_datetime AS DATE) NOTE_NLP CAST(nlp_datetime AS DATE) OBSERVATION CAST(observation_datetime AS DATE) PERSON CAST(birth_datetime AS DATE) PROCEDURE_OCCURRENCE CAST(procedure_datetime AS DATE) SPECIMEN CAST(specimen_datetime AS DATE) VISIT_DETAIL CAST(visit_detail_start_datetime AS DATE) VISIT_DETAIL CAST(visit_detail_end_datetime AS DATE) VISIT_OCCURRENCE CAST(visit_start_datetime AS DATE) VISIT_OCCURRENCE CAST(visit_end_datetime AS DATE)

Example SELECT num_violated_rows, CASE WHEN denominator.num_rows = 0 THEN 0 ELSE 1.0*num_violated_rows/denominator.num_rows END AS pct_violated_rows FROM ( select num_violated_rows from ( select case when count("CAST(observation_datetime AS DATE)") = 0 then 0 else 0 end as num_violated_rows from omop.dbo.OBSERVATION ) violated_rows ) violated_row_count, ( SELECT 1 as num_rows ) denominator ;

R Packages:

base 3.2.2 NULL
boot 1.3-17 R (>= 3.0.0), graphics, stats
class 7.3-13 R (>= 3.0.0), stats, utils
cluster 2.0.3 R (>= 2.15.0), utils
codetools 0.2-14 R (>= 2.1)
compiler 3.2.2 NULL
datasets 3.2.2 NULL
doParallel 1.0.10 R (>= 2.14.0), foreach(>= 1.2.0), iterators(>= 1.0.0),
parallel, utils NULL compiler, RUnit
doRSR 8.0.3 R (>= 2.5.0), foreach(>= 1.2.0), iterators(>= 1.0.0), RevoScaleR(>= 2.0-0), utils,
RevoUtils NULL NULL
foreach 1.4.3 R (>= 2.5.0)
foreign 0.8-65 R (>= 3.0.0)
graphics 3.2.2 NULL
grDevices 3.2.2 NULL
grid 3.2.2 NULL
iterators 1.0.8 R (>= 2.5.0), utils
KernSmooth 2.23-15 R (>= 2.5.0), stats
lattice 0.20-33 R (>= 3.0.0)
MASS 7.3-43 R (>= 3.1.0), grDevices, graphics, stats, utils
Matrix 1.2-2 R (>= 3.0.1)
methods 3.2.2 NULL
mgcv 1.8-7 R (>= 2.14.0), nlme (>= 3.1-64)
nlme 3.1-121 graphics, stats, R (>= 3.0.0)
nnet 7.3-10 R (>= 2.14.0), stats, utils
parallel 3.2.2 NULL
pkgXMLBuilder 1 NULL
RevoIOQ 8.0.3 R (>= 2.6.2), RUnit (>= 0.4.26), tools
revoIpe 1.1 NULL
RevoMods 8.0.3 R (>= 2.6.2)
RevoPemaR 8.0.3 R (>= 3.1.1), methods
RevoRpeConnector 8.0.3 NULL
RevoRsrConnector 8.0.3 NULL
RevoScaleR 8.0.3 R (>= 3.2.2)
RevoTreeView 8.0.3 NULL
RevoUtils 8.0.3 NULL
RevoUtilsMath 8.0.3 NULL
rpart 4.1-10 R (>= 2.15.0), graphics, stats, grDevices
RUnit 0.4.26 R (>= 2.5.0), utils (>= 2.5.0), methods (>= 2.5.0)
spatial 7.3-10 R (>= 3.0.0), graphics, stats, utils
splines 3.2.2 NULL
stats 3.2.2 NULL
stats4 3.2.2 NULL
survival 2.38-3 R (>= 2.13.0), graphics, stats
tcltk 3.2.2 NULL
tools 3.2.2 NULL
translations 3.2.2 NULL
utils 3.2.2 NULL
XML 3.98-1.1 R (>= 1.2.0), methods, utils
dmitryilyn commented 4 years ago

Same thing for Amazon Redshift - every check with "CAST([datetime_field] AS DATE)" fails with error column "cast([datetime_field] as date)" does not exist in [cdmTableName].

alondhe commented 4 years ago

Yeah, not sure why those quotes are there. Good catch! Fixing in my develop branch with some other fixes, aiming for next week to push to master

alondhe commented 4 years ago

Ahh, now I remember, we put the quotes because Postgres was having issues parsing count(offset) -- offset is a field name in the CDM table note_nlp

alondhe commented 4 years ago

Hacky, but I've added a Postgres only route that uses quotes. All other dialects will not use quotes.

deebowden commented 4 years ago

Thank you, we need to update this on our Broadsea docker container and remove our workarounds for SQL. Much appreciated Ajit!

clairblacketer commented 4 years ago

So I think what we may do instead is remove the cast statements. That way we should be able to keep the quotes for all dialects as it is really the cast statements that are causing the errors.

Andreyyiv commented 4 years ago

@alondhe The quotes are back. Please remove them.

alondhe commented 4 years ago

@Andreyyiv -- the quotes are fine in Postgres, it was the casting that was causing the issue. I checked this on my Postgres instance

Andreyyiv commented 4 years ago

@alondhe the quotes are causing issues in Oracle, e.g. ORA-00904: "person_id": invalid identifier. The column names in quotes should be in the upper case, e.g. "PERSON_ID". Without quotes it will match regardless of the case.

alondhe commented 4 years ago

@alondhe the quotes are causing issues in Oracle, e.g. ORA-00904: "person_id": invalid identifier. The column names in quotes should be in the upper case, e.g. "PERSON_ID". Without quotes it will match regardless of the case.

@Andreyyiv - so if we change it to use uppercase for all cdm field names in the queries, but with quotes, it'll work?

Andreyyiv commented 4 years ago

@alondhe Oracle creates objects in the upper case by default, and without quotes any case (e.g. mixed) can be used in the query. With quotes, the case must match to the table definition. In our instance the case was not explicitly specified, so either removing quotes or using upper case in quotes should work. However, if during the table creation in some other Oracle instance the specific case pattern was used like "Person_ID", the only method to specify this column in the query is "Person_ID".

alondhe commented 4 years ago

@Andreyyiv -- got it! I think for now, I will change the cdmFieldName values in the queries to be upper case to ensure it works. Can you try the develop branch?

remove.packages("DataQualityDashboard") restart R session devtools::install_github("OHDSI/DataQualityDashboard", ref = "develop")

Andreyyiv commented 4 years ago

@alondhe I tested the develop branch. The results are mixed. The Oracle errors associated with the field names are resolved, so this is an improvement. Also the log file contains several errors due to data quality (this is expected, I'll investigate and fix these outside this thread). The last record in the log file is

[Main thread] FATAL DataQualityDashboard $<- Error in ``$<-.data.frame``(``*tmp*``, "DQD_VERSION", value = "1.0.0") : replacement has 1 row, data has 0 Calls: <Anonymous> -> .summarizeResults -> $<- -> $<-.data.frame

(double `` is actually a single backtick in the original log), and the result JSON file was not generated.

alondhe commented 4 years ago

@Andreyyiv -- do you have a record in CDM_SOURCE? Because this error seems to indicate the data frame with that record came back as empty, and then tries to append to it.

Andreyyiv commented 4 years ago

@alondhe The CDM_SOURCE table was empty indeed. I inserted a row there, and now all the cdmField checks are complete. The JSON file was generated. So the issue is resolved.

Also I've noticed that this helped with the shiny page: previously both Metadata and Results sections were empty even there was the results data in the JSON file, now both sections have the expected contents.

Thank you!