episphere / connect

Connect API for DCEG's Cohort Study
10 stars 5 forks source link

Mixed Data Types in Participants Table #938

Open mnataraj92 opened 3 months ago

mnataraj92 commented 3 months ago

While doing data destruction testing, we noticed inconsistent data types in the participants table:

d_187894482_provided d_254109640_provided d_271757434_provided d_480305327_provided d_506826178_provided d_569151507_d_187894482_provided d_569151507_d_271757434_provided d_569151507_d_506826178_provided d_569151507_d_646873644_provided d_569151507_d_983278853_provided d_646873644_provided d_983278853_provided

Note from @jacobmpeters This is the issue that Warren is referring to.. For this variable (d_480305327), Firestore has both string and integer entries, so we end up with 3 variables in the flattened table for that CID. It would be best to deal with them in Firestore in my opinion.

we-ai commented 3 months ago

After fixing the mixed data types, BQ queries handling mixed data types ( eg d_187894482.number) will break and will need updates.

jacobmpeters commented 3 months ago

Thanks, Madhuri and Warren. The BQ query updates will be trivial on my end, but we'll need to coordinate timing.

jeannewu commented 3 months ago

Thank you very much, Madhuri and Warren

jeannewu commented 3 months ago

@we-ai Here attached are copies of list of variables with mixed data types found in the datasets in dev, prod and stg. Hope these lists would be helpful. Connect_GCP_Stg_mixed_datatype_03252024.csv Connect_GCP_Prod_mixed_datatype_03252024.csv Connect_GCP_Dev_mixed_datatype_03252024.csv

FrogGirl1123 commented 3 months ago

Hi @we-ai, the analytics team would like to move forward the correction of mixed data types, for this prod push if possible. Please coordinate with @jacobmpeters. If the mixed data type issue isn't specific to this prod release, then we could make this issue NTR and just push the correction/s when ready.

we-ai commented 3 months ago

@FrogGirl1123 The mixed data types are old problems not related to this release.

we-ai commented 3 months ago

@we-ai Here attached are copies of list of variables with mixed data types found in the datasets in dev, prod and stg. Hope these lists would be helpful. Connect_GCP_Stg_mixed_datatype_03252024.csv Connect_GCP_Prod_mixed_datatype_03252024.csv Connect_GCP_Dev_mixed_datatype_03252024.csv

Thank you @jeannewu for the extensive checks. I didn't know we have mixed data types in biospecimen and other collections. I'll find time to clean up the mixed data types and also update progress in this thread.

jeannewu commented 3 months ago

Hi, @we-ai I have found that the most of datasets in GCP have just been updated, compared with the ones I checked yesterday: some deprecated variables are removed, right? e.g. d_130371375.d_266600170.d_862774033 in the participants table, other some in the survey modules. Just head up and check. Thanks a lot.

FrogGirl1123 commented 3 months ago

Thanks @we-ai , I support cleaning up the mixed data variables as we can and not tying to the release.

mnataraj92 commented 2 months ago

Hi @we-ai, Jake and I were looking at the QC report and noticed that for d_271757434 (Can we leave voicemail at this number?- Mobile), there are 3 unique responses in the data.

These are the 3 unique responses: "{integer:104430631,string:null,provided:integer}", "{integer:null,string:,provided:string}", "{integer:353358909,string:null,provided:integer}"

and over 25,000 participants have one of those responses for this variable. Could you please look into this? Thank you!

we-ai commented 2 months ago

Hi @mnataraj92 @jeannewu, data cleanups were done today. Please check the change details in this folder, and let me know when you see errors. Hi @mnataraj92, the mixed data types for d_271757434 in prod, as mentioned above, should be fixed. But apparently there's a bug causing assigning an empty string to d_271757434. You may open an GH issue to tracking the bug fix.

mnataraj92 commented 2 months ago

Thanks @we-ai, sounds good! I'll open up a new issue for d_271757434

we-ai commented 2 months ago

Mixed data types have been fixed. Fields in stage/prod "participants" collection:

Fields in stage/prod "biospecimen" collection:

@jacobmpeters You may need to adjust data flattening scripts accordingly. @jeannewu @mnataraj92 Please let me know if you see errors. Details of changes are in folders 04-23 and 05-01.

jacobmpeters commented 2 months ago

Thanks, Warren. I can update the flattening tomorrow morning once the source tables have updated.

jeannewu commented 2 months ago

@we-ai Thanks, Warren. I will let you know any errors ( if any) tomorrow.

jeannewu commented 2 months ago

@we-ai Thank you very much for your helps with clean-up these mixed datatype variables. I've gone through the unflattened data in GCP.Connect (prod, dev and stg): there are still a few mixed datatype variables left in stage: (seen `belowimage. Thanks a lot.

we-ai commented 2 months ago

@jeannewu Share the file.

jeannewu commented 2 months ago

@we-ai here it is. Please check it and let me know if you have any problem. Connect_GCP_UnFlatten_Stg_mixed_datatype_05022024.csv

we-ai commented 2 months ago

For "bioSurvey_v1" collection, the mixed types are caused by strings and object (Record) values. For example, there're only 4 docs having "D_715581797_1_1" field as string values, while 9 docs have "D_715581797_1_1" field as object (Record) values. When stretching each of these docs to a row in BQ format, mixed data type issues occur. To clean up "bioSurvey_v1" collection in stage, we need a decision. For docs missing most of data fields (for example Connect IDs 9286750335, 9297845101 etc.), should we delete them or should we add fields with fake data to make data type consistent? @FrogGirl1123 @jacobmpeters @jeannewu

For "D_543780863" in "module2_v1" collection and "d_569151507" in "participants" collection, the data are in array format and docs with issues are not straightforward to me. Please let me know the token or Connect_ID of docs with data problem in each of the 2 collections. @jeannewu

jeannewu commented 2 months ago
@we-ai the mixed datatype variables are a little different from the previous version which comes with all of these components: string, integer and provided. These remnants look like with different structures, e.g. 543780863 (for Connect_ID:9941741740) is composed with more layers, not a simple mixed datatype one.
token
Connect_ID d_569151507. d_506826178 d_506826178. string d_506826178. integer d_506826178. provided

jeannewu commented 2 months ago
Sorry @we-ai I failed to paste the table with Connect_ID for those mixed datatype variable in participants table:
 
Connect_ID d_569151507. d_506826178 d_506826178. string d_506826178. integer d_506826178. provided

token   |   |   |   |   |   6e4f8e10-817d-475d-bd6c-6b209aea7de4 | 9901325701 | list(string = NA, integer = 1.96915591347127e-315, [...] | NA | 398561594 | integer   |   |   |   |   |     a0e3ad5a-a1af-4f58-bda2-a2a468db5b91 | 4876829298 | list(string = "", integer = 0, provided = "string" [...] | NA | string   |   |   |   |   |     41016d2d-899b-4e8a-9905-11aa01b6ca99 | 3430728271 | list(string = "", integer = 0, provided = "string" [...] | NA | string   |   |   |   |   |     74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | NA | 398561594 | integer   |   |   |   |   |     74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | NA | 398561594 | integer   |   |   |   |   |    

jeannewu commented 2 months ago
This table is much clearer to check:
token
Connect_ID d_569151507. d_506826178 d_506826178. integer d_506826178. string d_506826178. provided
41016d2d-899b-4e8a-9905-11aa01b6ca99 3430728271 list(string = "", integer = 0, provided = "string" [...] NA   string
6e4f8e10-817d-475d-bd6c-6b209aea7de4 9901325701 list(string = NA, integer = 1.96915591347127e-315, [...] list(string = NA, integer = 1.96915591347127e-315, [...] 398561594 NA integer

list(string = NA, integer = 1.96915591347127e-315, [...] 74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | 398561594 | NA | integer 74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | list(string = NA, integer = 1.96915591347127e-315, [...] | 398561594 | NA | integer list(string = NA, integer = 1.96915591347127e-315, [...] a0e3ad5a-a1af-4f58-bda2-a2a468db5b91 | 4876829298 | list(string = "", integer = 0, provided = "string" [...] | NA |   | string

we-ai commented 2 months ago

How should we deal with 715581797 variants in "bioSurvey_v1" collection?

And more complicated, there're mixed keys inside some of above keys:

We should have clear specs about concept ID 715581797 and format of data inside it. Current implementation need updates to get cleaner data in future. For the time being, how should we clean up these data? @FrogGirl1123 @brotzmanmj

jacobmpeters commented 2 months ago

@we-ai I will defer to @FrogGirl1123 on this. I don't have context around this. But here are my suggestions:

For the top level keys,

For the nested keys:

FrogGirl1123 commented 2 months ago

Hi @we-ai, I'm a little confused with the exchange. I thought the issue was having having any variable where some responses are string and others are integers. For each of these variables, the all data should be EITHER string or integer. The expected format (data type) for each variable is in the data dictionary (Char = String, Num = Integer). All variables with a ## are responses to loops, e.g. disease for each child, _1_1 is for the first child, _2_2 the second and so forth. These sets of variables should all have the same data type. For example, if D_715581797_1_1 is listed as a Num response in the DD, then D_715581797_2_2, and D_715581797_3_3, etc. would also have a Num data type. If you need any assistance finding the format/data type information in the DD, please reach out to @jacobmpeters, @jeannewu or @KELSEYDOWLING7

As far as updating the variable names, then I agree with Jake's proposed variable names.

we-ai commented 2 months ago

Hi @FrogGirl1123, you're right about mixed string and integers. Those have been corrected. Unfortunately, for D_715581797_1_1, it's never used as a Num type. It has mixed strings and objects as values. Please check the data in details. And let me know whether these need corrections.

we-ai commented 2 months ago

My original plan was to clean up mixed strings and integers. But since @jeannewu brought up concept IDs like D_715581797_1_1 in above comments, I had to check and provide my comments. These data have mixed strings and objects, with objects having variants of keys as mentioned above. Some of the data are confusing to me:

jacobmpeters commented 2 months ago
  • Why do we need a V2 of 715581797 (eg D_715581797_V2_1_1)?
  • Can V2 data be merged to V1 (eg D_715581797_1_1)?

My understanding is that these V2 tags indicate that the question was changed at some point.

  • It's concise to use top level of 715581797 as D_XXXXXXXXX_n. But do we really need nested concept IDs as D_XXXXXXXXX_n? D_XXXXXXXXX should be sufficient, right?

Good point. Yes, if the nested variable has an '_n' tag, then the top level does not need one.

jacobmpeters commented 2 months ago

@FrogGirl1123 @we-ai

I noticed that each of these covid-related variables in the blood/urine survey (aka bioSurvey_v1) are deprecated (as of 7/5/23) in the data dictionary and that similar variables are found in the COVID-19 survey. Does this mean that these data are obsolete?

Screenshot 2024-05-06 at 1 04 39 PM
brotzmanmj commented 2 months ago

Some background here might be helpful: We used to have COVID questions in the biospecimen surveys. The survey was taking too long to complete during the specimen visit. So we moved the COVID questions into a separate survey. Participants who completed the biospecimens survey before the COVID questions were moved out, still have valid data for the covid questions within the biospecimen survey. Participants who did the biospecimen survey after the covid questions were moved out to their own survey answered those questions in the separate survey. No data are obsolete. Eventually for analysis, covid questions will be combined for analysis regardless of which survey they were administered in.

jacobmpeters commented 2 months ago

@brotzmanmj Thanks for the context. That is very helpful.

jacobmpeters commented 1 month ago

For "bioSurvey_v1" collection, the mixed types are caused by strings and object (Record) values. For example, there're only 4 docs having "D_715581797_1_1" field as string values, while 9 docs have "D_715581797_1_1" field as object (Record) values. When stretching each of these docs to a row in BQ format, mixed data type issues occur. To clean up "bioSurvey_v1" collection in stage, we need a decision. For docs missing most of data fields (for example Connect IDs 9286750335, 9297845101 etc.), should we delete them or should we add fields with fake data to make data type consistent?

@we-ai If this issue is only occurring in stage and this part of the survey is no longer in use, then I think either approach is reasonable. Deleting test participant docs to resolve the issue seems simpler to me than adding fake data.

@FrogGirl1123 Do you agree? I'm happy to jump on a call to discuss so that we can give Warren a clear decision.

FrogGirl1123 commented 1 month ago

@we-ai I have higher priority items today and can't jump in here as much as seems needed. Please send a 30 minute invite for you, @jacobmpeters, @brotzmanmj if she would like to join, and me to meet tomorrow after 1pm or Wed. after noon.

we-ai commented 1 month ago

@we-ai If this issue is only occurring in stage and this part of the survey is no longer in use, then I think either approach is reasonable. Deleting test participant docs to resolve the issue seems simpler to me than adding fake data.

Sounds reasonable to me. I don't think I need to join a call. You can discuss and let me know your decisions @FrogGirl1123 @jacobmpeters

KELSEYDOWLING7 commented 1 month ago

My original plan was to clean up mixed strings and integers. But since @jeannewu brought up concept IDs like D_715581797_1_1 in above comments, I had to check and provide my comments. These data have mixed strings and objects, with objects having variants of keys as mentioned above. Some of the data are confusing to me:

  • Why do we need V2 of 715581797 (eg D_715581797_V2_1_1)?
  • Can V2 data be merged to V1 (eg D_715581797_1_1)?
  • It's concise to use top level of 715581797 as D_XXXXXXXXX_n. But do we really need nested concept IDs as D_XXXXXXXXX_n? D_XXXXXXXXX should be sufficient as a nested key, right?

@we-ai I wouldn't recommend merging the V1 and V2 variables. The variables are often versioned because the response data type has changed

jeannewu commented 1 month ago

When I checked for the mixed typed variables (string, integer and provided), in the BioSurvey_v1 data, data structures of these nested variables are holding both string, provided, and entity for each variable D_715581797_X_X. That's why I brought up to Warren to double check.

we-ai commented 1 month ago

@we-ai I wouldn't recommend merging the V1 and V2 variables. The variables are often versioned because the response data type has changed

I see V2 values are strings in prod (eg "2022-05"), which make more sense to me. And V1 values also have string values in stage (eg "2020-04" ). It's more reasonable for concept ID 715581797 to have string values (eg D_715581797_1_1: "2020-04" for connect ID 3350065725 in stage) than object values (eg D_715581797_1_1: {D_652923023_1_1: 2022, D_701387353_1_1: 5} for connect ID 8663702757 in prod). We have this opportunity clean up data of concept ID 715581797 and use strings consistently (use V2 format). Or should we keep using these 2 versions as we're now?

jeannewu commented 1 month ago

All the string values I have seen for this set of variables are date time (yyyy/mm/dd), which matches the question for this variable root, while D_715581797_X_X_entity is nested, too

KELSEYDOWLING7 commented 1 month ago

@we-ai I wouldn't recommend merging the V1 and V2 variables. The variables are often versioned because the response data type has changed

I see V2 values are strings in prod (eg "2022-05"), which make more sense to me. And V1 values also have string values in stage (eg "2020-04" ). It's more reasonable for concept ID 715581797 to have string values (eg D_715581797_1_1: "2020-04" for connect ID 3350065725 in stage) than object values (eg D_715581797_1_1: {D_652923023_1_1: 2022, D_701387353_1_1: 5} for connect ID 8663702757 in prod). We have this opportunity clean up data of concept ID 715581797 and use strings consistently (use V2 format). Or should we keep using these 2 versions as we're now?

I see your point. That variable went from two separate variables (one for month and one for year) to one year-month variable (v2). I'm ok with either scenario as long as the data is maintained. I don't see us doing any comparative analytics with response rate before and after the versioning

jacobmpeters commented 1 month ago

@we-ai Thanks for your work on this!

I spoke with @FrogGirl1123 and we have some decisions:

For the D_715581797_n_n variables in the bioSurvey_v1 table in stg, let's ensure that they all are given as objects, rather than strings. For those records that are currently strings, you can delete those fields so that the 715581797 is either absent or an object across all participants.

This will ensure that the v1 variables all have the same data type.

As far as the concerns about renaming/consolidating the variants (e.g., D_715581797_n_n & D_715581797_n) or consolidating v1 and v2 variables, let's save that for another issue. There are many similar issues with multiple name variants in other modules as well, so we should address those together.

@FrogGirl1123 Please confirm that this is ok with you.

FrogGirl1123 commented 1 month ago

Yes, @jacobmpeters and @we-ai, I confirm all of the above is alright with me.

we-ai commented 1 month ago

Based on above decisions, stage bioSurvey_v1 collection in Firestore was cleaned up, with details below:

Please let me know if you see problems. @jacobmpeters @FrogGirl1123

jacobmpeters commented 1 month ago

@we-ai I checked the bioSurvey_v1 table in BQ-stg and it appears that your updates have resolved the mixed data types for the 715581797 variants.

jacobmpeters commented 1 month ago

It looks like we still have a mixed data type for D_543780863 in module2_v1 in STG.

we-ai commented 1 month ago

On May 3, I replied @jeannewu and you @jacobmpeters in an email with info related to module2_v1 in stage: For module2_v1, I changed D_543780863[0] value from string to integer. But it may not be final, because I feel value of D_543780863 has mixed string and array. Is it true that D_543780863 in module2_v1 has mixed string and array? Or there're mixed data types inside the array? Please help me with more details.

jeannewu commented 1 month ago

@we-ai This variable D_543780863 the module2_v1 has different structures: integer, entity shown in its provided this time. I think its entity (nested components) are similar as the previous version as "character". One participant (Connect ID:9941741740) had an integer entry for this variable different from others with the entry as "181769837". And this concept ID in this entry seems a new one compared with the previous versions based on DD. I think it might be the reason generating this mixed data type entries under this variable. it will be helpful to make D_543780863 in one unique type. All the values (absolute values) of this variable are as correct as expected based on the DD. In this scenario, this mixed data type entries under this variable might happen in prod or dev if this variable has "181769837" entry there.

we-ai commented 1 month ago

In stage Firestore module2_v1, for Connect ID 9941741740, D_543780863[0] was changed back to string ("181769837"). Again, likely the issue is not solved. You can check D_543780863 in module2_v1 of stage and prod. In stage it has values of array (with possible mixed types in objects of the array), while in prod it has values of objects. I feel this should be treated as an separate issue since further checks and decisions are needed before doing cleanups. @jacobmpeters @jeannewu

jeannewu commented 1 month ago

@we-ai D_543780863 has no this case in either environment of "prod" or "dev".

we-ai commented 1 month ago

For Connect ID 9941741740 in stage module2_v1, D_543780863[0] value was changed from string to integer on May 3, and was changed back to string earlier today. The changed data (to string) is in BQ now. You can see neither integer nor string fixes data issue for D_543780863 in module2_v1. How to fix it? @jeannewu @jacobmpeters

jeannewu commented 1 month ago

@we-ai here attached is a copy of the comparison of the data structures of D_543780863 in prod and stg. Maybe this comparison might be helpful. M2v1_D_543780863_mixed_data_type_comparison_prod_stg_05132024.xlsx