OHDSI / DataQualityDashboard

A tool to help improve data quality standards in observational data science.
https://ohdsi.github.io/DataQualityDashboard
Apache License 2.0
145 stars 96 forks source link

Can the cdmDataType handle both Integer and BigInt #413

Closed krevett-iqvia closed 1 year ago

krevett-iqvia commented 1 year ago

We have run into a problem, starting in December, where we utilized the new 2.0.0 DQD package. For our ETL, we had to change the cdmdatatype to be BigInt in order to handle overflows that were occurring. The CDM 5.3.1 spec indicates that the cdmdatatype be set to Int. Is there anyway to have the cdmdatatype handle both Int and BigInts/

clairblacketer commented 1 year ago

Hi @krevett-iqvia for the CDM spec, both INT and BIGINT are appropriate as they are both integer datatypes. It is difficult to handle all cases when it comes to the spec so as long as the integers remain integers you are good

krevett-iqvia commented 1 year ago

Hi Clair, thanks for the quick reply. What I am being told is that the ETL team is running into a numeric overflow condition if they use Int - that is, the range for Int is being exceeded in some cases. That i why they switched to using BigInt for their TL work. This generates some cdmdatatype failed checks in our particular ETL. Can you recommend a way around this?

katy-sadowski commented 1 year ago

Hi @krevett-iqvia! Are you sure the check failures you're observing are coming from these bigint columns? I think the cdmDataType check should work on both int and bigint values the same way (it basically just checks that the value is a number - we're actually refining this logic in an upcoming release - PR here). It will also only run on fields marked as integer type in the threshold file.

krevett-iqvia commented 1 year ago

I believe the issue is that some tables are very large - such that, when you compute the rowcount, it overflows a 32 bit integer - so the only way to do a rowcount on tables that exceed an Int is to use a BigInt. How does the v2.0.0 of the dqd handle cases where the number of rows in a table exceeds an Int? I think this is the source of the problem.

krevett-iqvia commented 1 year ago

@clairblacketer @katy-sadowski @cgreich Has anyone been assigned to this ticket? It is an important ticket for us - as it creates many failed checks that ideally should not exist? Thank you very much for your prompt attention!!!!

katy-sadowski commented 1 year ago

@krevett-iqvia i think i still need more information to help you out. to start, could you please share the names of the checks that are failing and the SQL for the failing checks (copied from the UI of DQD)? if there are many similar ones feel free to just share the SQL for a few of them. thanks!

krevett-iqvia commented 1 year ago

@katy-sadowski Hi I am attaching some data that demonstrates the issue we are having. The issue is that the cdmDataType check appears to take into account the magnitude of the argument passed to it. The ETL utilizes a BigInt in the SQL - to accommodate values that exceed a 32 bit integer. The cdmdatatype check fails when a value > 21437483647 is passed - the max value for a signed 32 bit Int. I understood that the checktype only parses its argument using a regex - checking that each element in the argument - a string of digits - not sure what the arg type is - but checks that each element is a digit - as specified by the regex. But, it also appears that their is a magnitude constraint placed on the arg. I am not sure why this would be - but the results from our runs indicates that their is a magnitude constraint. I attach a document with data that demonstrates this fact.

cdmDatatype_check_error

krevett-iqvia commented 1 year ago

cdmDatatype_check_error_SQL This is the initial SQL that generated the field value that generates an overflow error - see the previous image for the response.

clairblacketer commented 1 year ago

Hi @krevett-iqvia I am not sure why it would be failing on overflow. We have many databases that use bigints for identifiers and DQD runs fine. What is your DBMS?

krevett-iqvia commented 1 year ago

I believe it is a Terradata DB. This was a new event, beginning with the v2.0.0 of the DQD.

krevett-iqvia commented 1 year ago

Hi @clairblacketer, A quick question - cdmDatatype checktype hsa been around for some time I believe? I thought it was newly introduced in the v2.0.0 of the DQD pkg. My mistake - it has been around a while - at least since Marc 2022. Has the code for this checktype changed in the v2.0.0 of the DQD, or has it remained the same?

clairblacketer commented 1 year ago

Hi @krevett-iqvia the code for that check has remained the same. All it is doing is a simple multiplication to make sure integer columns are truly numbers since we cannot go into the system files for each dbms. If your db is Teradata, that is not supported by SqlRender which could be causing this problem.

krevett-iqvia commented 1 year ago

@clairblacketer - hmm that is strange. The problem started with the 2.0.0 version - so I assumed there was a change in the code base. Teradata is not supported! We are trying to migrate to Snowflake - that I believe is supported. So, a simple multiplication - are you multiplying the value by 1 - and storing the result in a volatile var? Or in place? Is it possible to send me the code for the checktype? Is it stored as an R function? I am really curious. Thank you for this info!!!!

krevett-iqvia commented 1 year ago

Hi @katy-sadowski - I am still not clear about how we check cdmfields as being integer values - using the cdmDatatype checktype. Do you generically use IsNumeric() - across all versions of SQL? The code snippet I saw where you addressed a related issue for Hades was utilizing IsNumeric() - is this the current definition (field_cdm_dataype.sql) for all flavors of SQL?

katy-sadowski commented 1 year ago

@krevett-iqvia here is the SQL query for this check: https://github.com/OHDSI/DataQualityDashboard/blob/main/inst/sql/sql_server/field_cdm_datatype.sql. The ISNUMERIC will get translated according to your database's SQL dialect by SqlRender, whose translation rules can be found here: https://github.com/OHDSI/SqlRender/blob/main/inst/csv/replacementPatterns.csv

It's quite odd that your SQL is doing a cast to integer in the screenshot above (and this explains the source of your issue). A couple more questions for you:

clairblacketer commented 1 year ago

@krevett-iqvia turns out it is an ISNUMERIC function (it was multiplication a while back I think). This could be causing a problem if it this function is not supported on bigint columns in Teradata. Likely it is a SqlRender issue since it technically does not translate to Teradata.

krevett-iqvia commented 1 year ago

Hi @katy-sadowski - has the SQL for field_cdm_datatype.sql changed in v2.0.0 - is this the same definition that was used for the previous version (v1.4.0) of the dqd?

clairblacketer commented 1 year ago

Hi @krevett-iqvia I went back in the git history and it has been ISNUMERIC() since 2020 so it was not changed between 1.4.0 and 2.0.0.

krevett-iqvia commented 1 year ago

Hi @clairblacketer - the issue I believe is the following aspect of the field level queries: Here is a Jan'23 ETL query:

/***** FIELD_CDM_DATATYPE At a minimum, for each field that is supposed to be an integer, verify it is an integer Parameters used in this template: cdmDatabaseSchema = HCUAT_OMOP cdmTableName = PERSON cdmFieldName = PERSON_ID **/ SELECT num_violated_rows, CASE WHEN denominator.num_rows = 0 THEN 0 ELSE 1.0num_violated_rows/denominator.num_rows END AS pct_violated_rows, denominator.num_rows AS num_denominator_rows FROM ( SELECT CAST(COUNT(violated_rows.violating_field) AS BIGINT) AS num_violated_rows FROM ( /violatedRowsBegin/ SELECT 'PERSON.PERSON_ID' AS violating_field, cdmTable. FROM HCUAT_OMOP.PERSON cdmTable WHERE CASE WHEN tryCast(CAST(cdmTable.PERSON_ID AS CHAR(1000)) AS INT) is not null THEN 1 ELSE 0 END = 0 AND cdmTable.PERSON_ID IS NOT NULL /violatedRowsEnd/ ) violated_rows ) violated_row_count, ( SELECT CAST(COUNT(*) AS BIGINT) AS num_rows FROM HCUAT_OMOP.PERSON ) denominator ;

Here is a randomly selected pre-v2.0.0 query:

/***** FIELD_CDM_DATATYPE At a minimum, for each field that is supposed to be an integer, verify it is an integer Parameters used in this template: cdmDatabaseSchema = HCUAT_OMOP cdmTableName = PERSON cdmFieldName = PERSON_ID **/ SELECT num_violated_rows, CASE WHEN denominator.num_rows = 0 THEN 0 ELSE 1.0num_violated_rows/denominator.num_rows END AS pct_violated_rows, denominator.num_rows as num_denominator_rows FROM ( SELECT CAST(COUNT(violated_rows.violating_field) AS BIGINT) AS num_violated_rows FROM ( /violatedRowsBegin/ SELECT 'PERSON.PERSON_ID' AS violating_field, cdmTable. FROM HCUAT_OMOP.PERSON cdmTable WHERE CASE WHEN tryCast(CAST(cdmTable.PERSON_ID AS CHAR(1000)) AS BIGINT) is not null THEN 1 ELSE 0 END = 0 AND cdmTable.PERSON_ID IS NOT NULL /violatedRowsEnd/ ) violated_rows ) violated_row_count, ( SELECT CAST(COUNT(*) AS BIGINT) AS num_rows FROM HCUAT_OMOP.PERSON ) denominator ;

The difference is the INT versus BIGINT: see the highlighted portion of the 2 queries. The INT will fail if the field has a value greater then a 32 bit signed value. The error is being assigned to the cdmDatatype check_name - which is confusing to me - as that checktype would not fail - as it uses ISNUMERIC() and count_big - so it would not fail on a field value being > 32 bit nor on a rowcount > 32 bit.

I hope I am getting clearer on describing the problem...?

katy-sadowski commented 1 year ago

Thanks @krevett-iqvia; I agree with @clairblacketer that this seems to be a SqlRender issue - this bolded bit of the base query from DQD will get translated by SqlRender depending on the database specified in the connectionDetails passed into DQD.

That said, I'm pretty baffled here as I can't spot anywhere in SqlRender code where ISNUMERIC would be translated to what's in your queries above. I'm thinking either you have an older version of SqlRender installed, or perhaps some custom implementation of SqlRender or DQD that'd be doing this translation? Are you able to see what code has been executed to run DQD - specifically wondering what the dbms param is in the connectionDetails passed into executeDqChecks. Also if you're able to figure out what version of SqlRender you're using that'd be helpful too.

Thanks!

krevett-iqvia commented 1 year ago

Hi @katy-sadowski - I am so sorry to be a pain in the butt here. I am not the one actually running the dqd package - I only take the results and process them from a QA perspective - I have no access to the data and the code base that is used to generate the DQD results. They are using Teradata - which I understand is not supported - so there has been some bespoke code modification for sure. Everything was fine until they switched to the 2.0.0 DQD pkg. In my QA analysis - I see failed checks associated with the cdmDataTtype check - which I believe should not happen - as per my previous statements. ISNUMERIC() should not be failing. I am awaiting the details you requested. Will pass that on asap. Thank you @clairblacketer - my apologies for being a pest!

krevett-iqvia commented 1 year ago

Hi @katy-sadowski - they are running version 1.6.4 of SqlRender, and setting the dbms = 'Teradata'

katy-sadowski commented 1 year ago

Thanks @krevett-iqvia . I checked that release of SqlRender (https://github.com/OHDSI/SqlRender/releases/tag/v1.6.4) and don't see references to teradata or the trycast function anywhere in there. I'm guessing your team must have made some modifications to the SqlRender (and maybe also DQD) code to enable the teradata translations.

Looking at it...if there's been no change to your codebase other than the update to DQD, maybe what changed was the datatype specified for some of these fields in the thresholds file (bigint --> int), which is then feeding into the parameterized SQL. Maybe you all had modified the datatype for some fields to bigint in your local threshold file that got overwritten by DQD defaults of integer?

krevett-iqvia commented 1 year ago

Hi @katy-sadowski, thanks for this. So, I believe then that the default datatype for numeric data is Int in the DQD? That is interesting. Will investigate the threshold files. Thank you!!!

katy-sadowski commented 1 year ago

No problem, glad we're honing on in on the issue :) The default data type for each column in the DQD threshold files should match the column's type in the CDM spec for that version of the CDM. For condition_era_id, it looks like this type is integer in the CDM spec as well as the threshold file.

revettk commented 1 year ago

Hi @katy-sadowski - you have been so helpful and patient! Thank you very much for your assistance on this ticket. I believe that we can consider this ticket closed. We run our DQD on a Snowflake instance, as well as on Terradata. This is a non-issue on the Snowflake instance - so - until we can modify the the DQD threshold files accordingly, we can use the DQD results from the Snowflake instance. I believe that you should close this ticket!

katy-sadowski commented 1 year ago

OK, thanks for the update @krevett-iqvia ! Happy to help.