episphere / connect

Connect API for DCEG's Cohort Study
10 stars 5 forks source link

Mixed Data Types in Participants Table #938

Open mnataraj92 opened 6 months ago

mnataraj92 commented 6 months ago

While doing data destruction testing, we noticed inconsistent data types in the participants table:

d_187894482_provided d_254109640_provided d_271757434_provided d_480305327_provided d_506826178_provided d_569151507_d_187894482_provided d_569151507_d_271757434_provided d_569151507_d_506826178_provided d_569151507_d_646873644_provided d_569151507_d_983278853_provided d_646873644_provided d_983278853_provided

Note from @jacobmpeters This is the issue that Warren is referring to.. For this variable (d_480305327), Firestore has both string and integer entries, so we end up with 3 variables in the flattened table for that CID. It would be best to deal with them in Firestore in my opinion.

we-ai commented 4 months ago

What's the suggestion/conclusion fixing D_543780863 data in stage module2_v1? @jeannewu

jeannewu commented 4 months ago

Will you think it possible to update the D_543780863 in stg as the same structure as it in prod?

jacobmpeters commented 4 months ago

For Connect ID 9941741740 in stage module2_v1, D_543780863[0] value was changed from string to integer on May 3, and was changed back to string earlier today. The changed data (to string) is in BQ now. You can see neither integer nor string fixes data issue for D_543780863 in module2_v1. How to fix it? @jeannewu @jacobmpeters

@we-ai It looks to me like D_543780863.D_543780863 for Connect ID 9941741740 needs to be changed from a string (i.e., "181769837") to an array containing a string (i.e., ["181769837"]). This should resolve the mixed data type issue for this field.

In production, D_543780863.D_543780863 is an array of strings.

In stage, D_543780863.D_543780863 is an array of strings for 34 participants, but for a single participant (Connect_ID 9941741740) it is a string.

we-ai commented 4 months ago

@we-ai It looks to me like D_543780863.D_543780863 for Connect ID 9941741740 needs to be changed from a string (i.e., "181769837") to an array containing a string (i.e., ["181769837"]). This should resolve the mixed data type issue for this field.

In production, D_543780863.D_543780863 is an array of strings.

In stage, D_543780863.D_543780863 is an array of strings for 34 participants, but for a single participant (Connect_ID 9941741740) it is a string.

Thanks for the detailed checking @jacobmpeters. I have similar conclusions, except that I see Connect ID 9941741740 has D_543780863 as an array while others have the value as objects. This is a case of mixed array type and object type.

For stage Firestore module2_v1 Connect ID 9941741740, data was changed from D_543780863: ["181769837"] to D_543780863: {D_543780863: ["181769837"]}. The change should fix this mixed data types problem. @jacobmpeters @jeannewu

jacobmpeters commented 4 months ago

@we-ai What you said makes sense to me. I just checked and the mixed type issue for D_543780863 appears to be resolved in stg and the schema appears to match that of prod.

BEFORE (May 13, 2024 1:46 PM)

Screenshot 2024-05-13 at 1 46 28 PM

AFTER (May 13, 2024 3:08 PM)

Screenshot 2024-05-13 at 3 08 18 PM
jacobmpeters commented 4 months ago

I checked the whole database in BQ and there are no more mixed-type fields, so I believe we can close this issue now. Thanks for your work on this!

Here is the query that I have been using for checking for mixed-type fields:

## MIXED_TYPE_CHECK: 

-- Check BQ tables for a column containing the substring ".provided".
-- All mixed-type field's have a `D_XXXXXXXXXX.provided` field, 
-- so this is a good test for mixed type data.

## Check PROD ---------------------------------------------------------
SELECT table_schema, table_name, column_name
FROM `nih-nci-dceg-connect-prod-6d04.Connect.INFORMATION_SCHEMA.COLUMNS`
WHERE column_name LIKE '%.provided%';
-- 0 Results as of May 13, 4:26 PM

## Check STG ----------------------------------------------------------
SELECT table_schema, table_name, column_name
FROM `nih-nci-dceg-connect-stg-5519.Connect.INFORMATION_SCHEMA.COLUMNS`
WHERE column_name LIKE '%.provided%';
-- 0 Results as of May 13, 4:26 PM

## Check DEV ----------------------------------------------------------
SELECT table_schema, table_name, column_name
FROM `nih-nci-dceg-connect-dev.Connect.INFORMATION_SCHEMA.COLUMNS`
WHERE column_name LIKE '%.provided%'
-- 0 Results as of May 13, 4:26 PM
sonyekere commented 4 months ago

Closing issue, marked as 'done' on punchlist.

jacobmpeters commented 4 months ago

@sonyekere @we-ai

@jeannewu and I noticed that there are some mixed type variables in BQ this morning. It is not clear whether these reappeared due to an underlying coding issue or if I missed them somehow last week.

Row environment table_schema table_name column_name data_type
1 PROD Connect biospecimen d_646899796 STRUCT<integer INT64, string STRING, provided STRING>
2 PROD Connect participants undefined STRUCT<string STRING, integer INT64, provided STRING>
3 STG Connect promis_v1 D_424548783 STRUCT<integer INT64, float FLOAT64, provided STRING>
4 STG Connect promis_v1 D_697423629 STRUCT<integer INT64, float FLOAT64, provided STRING>
jacobmpeters commented 4 months ago

It looks like the promis_v1 variables are related to the PROMIS return scores so I alerted Tony on that issue(#870).

we-ai commented 4 months ago

For stage promis_v1 data, the mixed types is caused by NaN values of D_424548783 and D_697423629, in doc with Connect ID 5491106047.

For prod biospecimen data, there're 2 docs with Connect ID 6569604598. One of the docs has empty string for d_646899796. This field with empty string was deleted, and mixed data issue on this field should be fixed.

For prod participants data, the "undefined" key is clearly a result of bug(s) of mixed scenarios. There're 140 prod docs with this "undefined" key which could contain mixed concept IDs that failed during data updates. We'll need decisions on how to clean up these.

jacobmpeters commented 4 months ago

Thanks, Warren.

jacobmpeters commented 4 months ago

The issue with d_646899796 is apparently resolved. Thanks for your help, Warren.

Seems like we are all set for now!

jacobmpeters commented 4 months ago

@we-ai I ran my script to check for mixed type data and discovered these. It looks like each variable in this list contains both strings and objects.

environment table_schema table_name column_name data_type
PROD Connect module3_v1 D_633553324 { "string": "STRING", "entity": { "D_602102163": "STRING", "D_164707243": "STRING", "D_818310825": "STRING", "..." }, "provided": "STRING" }
PROD Connect module3_v1 D_470862706 { "string": "STRING", "entity": { "D_690918725": [ "STRING" ], "D_178609656": [ "STRING" ], "D_261957180": [ "STRING" ], "..." }, "provided": "STRING" }
PROD Connect covid19Survey_v1 D_749956170 { "string": "STRING", "entity": { "D_143206081": "STRING", "D_304155106": "STRING", "D_599862694": "STRING", "..." }, "provided": "STRING" }
PROD Connect covid19Survey_v1 D_114280729 { "string": "STRING", "entity": { "D_590361055": "STRING", "D_368669706": "STRING", "D_966214244": "STRING", "..." }, "provided": "STRING" }
PROD Connect module4_v1 D_440093675 { "string": "STRING", "entity": { "D_662584384": "STRING", "D_593010802": "STRING", "D_511991969": "STRING", "..." }, "provided": "STRING" }
PROD Connect module4_v1 D_679430807 { "string": "STRING", "entity": { "D_686494153": "STRING", "D_935326619": "STRING", "D_897920852": "STRING", "..." }, "provided": "STRING" }
PROD Connect module4_v1 D_786253125 { "string": "STRING", "entity": { "D_146307133": "STRING", "D_552410034": "STRING", "D_289302451": "STRING", "..." }, "provided": "STRING" }
PROD Connect module4_v1 D_135529881 { "string": "STRING", "entity": { "D_265423096": "STRING", "D_185533250": "STRING", "D_173979298": "STRING", "..." }, "provided": "STRING" }
PROD Connect module4_v1 D_968388901 { "string": "STRING", "entity": { "D_116882225": "STRING", "D_786244297": "STRING", "D_478513687": "STRING", "..." }, "provided": "STRING" }
PROD Connect participants undefined { "string": "STRING", "integer": "INT64", "provided": "STRING" }
we-ai commented 3 months ago

There should be multiple bugs underneath these mixed data. @jacobmpeters

we-ai commented 3 months ago

Does analytic team have decisions regarding how these mixed data should be fixed? @jacobmpeters @jeannewu @FrogGirl1123

jeannewu commented 3 months ago

@we-ai I will check these variables to see how they are associated with (Variable labels and participants) for the coming decision making later

jeannewu commented 3 months ago

@we-ai , I have checked these nested variables defined as a string, which are in the nested with another nested variables as a record defined as an entity. All those entries under all the nested components are all CIDs no any texts variables or numeric ones if available. So if the nested recorded variable can be changed an array of strings, it might be helpful to get all the data type come to be string.

jeannewu commented 3 months ago

@jacobmpeters Thanks a lot for your reminder, Jake. very important. I will delete them soon.

jacobmpeters commented 3 months ago

Thanks, Warren.

jeannewu commented 3 months ago

@we-ai @jacobmpeters Thank you very much for your helps. I've put all the mixed datatype data with the participants Connect_IDs in the . Please check them and let me know if anything else is needed.

jeannewu commented 3 months ago

In case, the hyperlink is not seen:

we-ai commented 3 months ago

@jeannewu 6 of the 8 shared tables have no content. For the 2 tables with thousands of rows, I don't know what I'm supposed to look at.

I'm doing checking and will let you know what I need. @jeannewu @jacobmpeters

jeannewu commented 3 months ago

@we-ai The data I pulled for these mixed typed variables were to those participants who have entries as string or entities in each mixed datatype variables. if it is empty, it means it hasn't have any solid input yet.

jeannewu commented 3 months ago
@we-ai I have put the datasets of all mixed type data in the folder you assigned to me with the most recent data. Please check them. also there are two more mixed datatype variables found in module2_v2 and module4_v1
table_name
column_name

jeannewu commented 3 months ago

Hi, Warren, Sorry for the empty datasets I put yesterday. I have uploaded the new datasets of mixed datatyped variables extracted from the source tables which might be easier to track down the issues on how those mixed datatyped variables are generated.

we-ai commented 3 months ago

Hi, Warren, Sorry for the empty datasets I put yesterday. I have uploaded the new datasets of mixed datatyped variables extracted from the source tables which might be easier to track down the issues on how those mixed datatyped variables are generated.

As mentioned above, what I need are detailed decisions about "how these mixed data should be fixed", not tables showing mixed data.

Maybe we should have a meeting for more effective communication.

we-ai commented 3 months ago

There's a "Module2" field in participants table in BQ. Is it supposed to be there? @jacobmpeters

jacobmpeters commented 3 months ago

@we-ai No Module2 should not be there. In fact, I filter it out during flattening so that the analysts do not see it. I assume it was a mishap of some kind when the tables were broken out last year.

I support the idea of meeting to discuss how to fix these mixed type data in production. I will be away next week, but we can set up a meeting for the week of June 24. I can send an invite to @we-ai and @FrogGirl1123. @jeannewu will be away that week.

we-ai commented 1 month ago

@jacobmpeters "D_749956170" data bug was fixed in issue#1049, right?

jacobmpeters commented 1 month ago

@we-ai This list of variables are the ones that still have mixed data types in production. The count indicates the number of records with this problematic data structure. The min/max start/completion times for the surveys with the mixed data types are indicated below. It seems like the last person to begin a survey and generate a mixed type record began module 4 on 08/04/2024. Does this timing correspond with the fix in https://github.com/episphere/connect/issues/1049?

Even if the underlying issue generating these mixed-type data is resolved. We still need to correct the data structure for the records that already exist.

table variable count start_min start_max completion_min completion_max
module3_v1 D_633553324 58 05/30/2024 07/30/2024 06/03/2024 07/30/2024
covid19Survey_v1 D_749956170 131 05/03/2024 07/31/2024 05/31/2024 07/31/2024
covid19Survey_v1 D_114280729 243 04/01/2024 07/31/2024 05/31/2024 07/31/2024
module4_v1 D_440093675 2434 04/16/2023 08/01/2024 05/31/2024 08/09/2024
module4_v1 D_219317801 3 06/12/2024 07/16/2024 06/12/2024 07/16/2024
module4_v1 D_679430807 83 01/26/2024 07/31/2024 06/02/2024 07/31/2024
module4_v1 D_786253125 194 08/05/2023 08/01/2024 05/31/2024 08/09/2024
module4_v1 D_135529881 50 05/30/2024 07/29/2024 06/01/2024 07/29/2024
module4_v1 D_968388901 1328 04/16/2023 08/04/2024 05/31/2024 08/06/2024
module2_v2 D_128705365 8 06/12/2024 07/31/2024 06/12/2024 07/31/2024
we-ai commented 1 month ago

@jacobmpeters July prod release was on 7-31-2024. Issues starting occurring after the prod release date could be new ones. Issues stopping occurring after the release date might have been fixed by the updates.