OHDSI / DataQualityDashboard

A tool to help improve data quality standards in observational data science.
https://ohdsi.github.io/DataQualityDashboard
Apache License 2.0
145 stars 96 forks source link

Non reproducible results: 11 DQ checks belonging to the Plausible value high group produce different results when DQD run on different day #277

Open jimfrankfort opened 2 years ago

jimfrankfort commented 2 years ago

As part of QA on modifications to the DQD we compared results from the same data using the old DQD and new DQD. We were surprised to find that 11 DQ checks belonging to the Plausible value high group produced different results. The reason for the difference was that the testing was done on different days and the DQ checks reference the current date using GetDate(). See example below. We think it breaks a fundamental rule for those who build and maintain systems for others. The rule is: analysis X run on data Y should produce the same results regardless of when you run it. Clients get very upset when a given analysis run on the same data gives different results. It makes it very difficult to re-produce things not to mention wasting time. Consider replacing GetDate with a date that is inherent in the data (e.g. data period end) or a parameter.

The number and percent of records with a value in the DEVICE_EXPOSURE_END_DATE field of the DEVICE_EXPOSURE table greater than DATEADD(DD,1,GETDATE()).

SELECT num_violated_rows, CASE WHEN denominator.num_rows = 0 THEN 0 ELSE 1.0num_violated_rows/denominator.num_rows END AS pct_violated_rows, denominator.num_rows as num_denominator_rows FROM ( SELECT CAST(COUNT(violated_rows.violating_field) AS BIGINT) AS num_violated_rows FROM ( /violatedRowsBegin/ SELECT 'DEVICE_EXPOSURE.DEVICE_EXPOSURE_END_DATE' AS violating_field, cdmTable. from HCPRD_OMOP_BASE.DEVICE_EXPOSURE cdmTable where cast(cdmTable.DEVICE_EXPOSURE_END_DATE as date) > cast((CURRENT_DATE + INTERVAL '1' DAY) as date) /violatedRowsEnd/ ) violated_rows ) violated_row_count, ( SELECT CAST(COUNT(*) AS BIGINT) AS num_rows FROM HCPRD_OMOP_BASE.DEVICE_EXPOSURE cdmTable

where DEVICE_EXPOSURE_END_DATE is not null ) denominator ;

clairblacketer commented 2 years ago

I agree we should think about replacing getdate with a date that is in the data itself, but the point was to make sure that no dates are listed as happening in the future. I am not sure how that check result would change if it is run on the same data but on different days. If no records occur in the future on one run then logically no records will occur in the future when run on a later date.

clairblacketer commented 2 years ago

What if we changed this and add a parameter called "future_date" so you can either set it or go with observation_period_end_date? This is important because many vendors tell you what date your data should go through but it is often incorrect

MaximMoinat commented 2 years ago

I was running into this as well and want to add my 2 cents.

I am not sure how that check result would change if it is run on the same data but on different days. If no records occur in the future on one run then logically no records will occur in the future when run on a later date.

The reverse might actually happen; checks failing in an old run will pass in a later run because we are further into the future.

My proposal would be to use thecdm_source.source_release_date as this 'future_date' (or cdm_source.cdm_release_date). Happy to discuss in an upcoming DQD WG meet.

clairblacketer commented 2 years ago

@MaximMoinat I like that idea, I was trying to figure out the right future date to use that remains static.

clairblacketer commented 2 years ago

This should be a relatively easy check to implement for a new comer as it should really only involve changing the plausibleValueHigh in the field level control file and then testing