Open mnataraj92 opened 8 months ago
After fixing the mixed data types, BQ queries handling mixed data types ( eg d_187894482.number
) will break and will need updates.
Thanks, Madhuri and Warren. The BQ query updates will be trivial on my end, but we'll need to coordinate timing.
Thank you very much, Madhuri and Warren
@we-ai Here attached are copies of list of variables with mixed data types found in the datasets in dev, prod and stg. Hope these lists would be helpful. Connect_GCP_Stg_mixed_datatype_03252024.csv Connect_GCP_Prod_mixed_datatype_03252024.csv Connect_GCP_Dev_mixed_datatype_03252024.csv
Hi @we-ai, the analytics team would like to move forward the correction of mixed data types, for this prod push if possible. Please coordinate with @jacobmpeters. If the mixed data type issue isn't specific to this prod release, then we could make this issue NTR and just push the correction/s when ready.
@FrogGirl1123 The mixed data types are old problems not related to this release.
@we-ai Here attached are copies of list of variables with mixed data types found in the datasets in dev, prod and stg. Hope these lists would be helpful. Connect_GCP_Stg_mixed_datatype_03252024.csv Connect_GCP_Prod_mixed_datatype_03252024.csv Connect_GCP_Dev_mixed_datatype_03252024.csv
Thank you @jeannewu for the extensive checks. I didn't know we have mixed data types in biospecimen
and other collections.
I'll find time to clean up the mixed data types and also update progress in this thread.
Hi, @we-ai I have found that the most of datasets in GCP have just been updated, compared with the ones I checked yesterday: some deprecated variables are removed, right? e.g. d_130371375.d_266600170.d_862774033 in the participants table, other some in the survey modules. Just head up and check. Thanks a lot.
Thanks @we-ai , I support cleaning up the mixed data variables as we can and not tying to the release.
Hi @we-ai, Jake and I were looking at the QC report and noticed that for d_271757434 (Can we leave voicemail at this number?- Mobile), there are 3 unique responses in the data.
These are the 3 unique responses: "{integer:104430631,string:null,provided:integer}", "{integer:null,string:,provided:string}", "{integer:353358909,string:null,provided:integer}"
and over 25,000 participants have one of those responses for this variable. Could you please look into this? Thank you!
Hi @mnataraj92 @jeannewu, data cleanups were done today. Please check the change details in this folder, and let me know when you see errors. Hi @mnataraj92, the mixed data types for d_271757434 in prod, as mentioned above, should be fixed. But apparently there's a bug causing assigning an empty string to d_271757434. You may open an GH issue to tracking the bug fix.
Thanks @we-ai, sounds good! I'll open up a new issue for d_271757434
Mixed data types have been fixed. Fields in stage/prod "participants" collection:
Fields in stage/prod "biospecimen" collection:
@jacobmpeters You may need to adjust data flattening scripts accordingly. @jeannewu @mnataraj92 Please let me know if you see errors. Details of changes are in folders 04-23 and 05-01.
Thanks, Warren. I can update the flattening tomorrow morning once the source tables have updated.
@we-ai Thanks, Warren. I will let you know any errors ( if any) tomorrow.
@we-ai Thank you very much for your helps with clean-up these mixed datatype variables. I've gone through the unflattened data in GCP.Connect (prod, dev and stg): there are still a few mixed datatype variables left in stage: (seen `below. Thanks a lot.
@jeannewu Share the file.
@we-ai here it is. Please check it and let me know if you have any problem. Connect_GCP_UnFlatten_Stg_mixed_datatype_05022024.csv
For "bioSurvey_v1" collection, the mixed types are caused by strings and object (Record) values. For example, there're only 4 docs having "D_715581797_1_1" field as string values, while 9 docs have "D_715581797_1_1" field as object (Record) values. When stretching each of these docs to a row in BQ format, mixed data type issues occur. To clean up "bioSurvey_v1" collection in stage, we need a decision. For docs missing most of data fields (for example Connect IDs 9286750335, 9297845101 etc.), should we delete them or should we add fields with fake data to make data type consistent? @FrogGirl1123 @jacobmpeters @jeannewu
For "D_543780863" in "module2_v1" collection and "d_569151507" in "participants" collection, the data are in array format and docs with issues are not straightforward to me. Please let me know the token or Connect_ID of docs with data problem in each of the 2 collections. @jeannewu
@we-ai the mixed datatype variables are a little different from the previous version which comes with all of these components: string, integer and provided. These remnants look like with different structures, e.g. 543780863 (for Connect_ID:9941741740) is composed with more layers, not a simple mixed datatype one. | Connect_ID | d_569151507. d_506826178 | d_506826178. string | d_506826178. integer | d_506826178. provided |
---|
Sorry @we-ai I failed to paste the table with Connect_ID for those mixed datatype variable in participants table: | Connect_ID | d_569151507. d_506826178 | d_506826178. string | d_506826178. integer | d_506826178. provided |
---|
token | | | | | 6e4f8e10-817d-475d-bd6c-6b209aea7de4 | 9901325701 | list(string = NA, integer = 1.96915591347127e-315, [...] | NA | 398561594 | integer | | | | | a0e3ad5a-a1af-4f58-bda2-a2a468db5b91 | 4876829298 | list(string = "", integer = 0, provided = "string" [...] | NA | string | | | | | 41016d2d-899b-4e8a-9905-11aa01b6ca99 | 3430728271 | list(string = "", integer = 0, provided = "string" [...] | NA | string | | | | | 74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | NA | 398561594 | integer | | | | | 74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | NA | 398561594 | integer | | | | |
This table is much clearer to check:
token |
Connect_ID | d_569151507. d_506826178 | d_506826178. integer | d_506826178. string | d_506826178. provided | |
---|---|---|---|---|---|---|
41016d2d-899b-4e8a-9905-11aa01b6ca99 | 3430728271 | list(string = "", integer = 0, provided = "string" [...] | NA | string | ||
6e4f8e10-817d-475d-bd6c-6b209aea7de4 | 9901325701 | list(string = NA, integer = 1.96915591347127e-315, [...] | list(string = NA, integer = 1.96915591347127e-315, [...] | 398561594 | NA | integer |
list(string = NA, integer = 1.96915591347127e-315, [...] 74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | 398561594 | NA | integer 74c537f6-a57a-45d6-a3f9-c7f595d5a74e | 5935525408 | list(string = NA, integer = 1.96915591347127e-315, [...] | list(string = NA, integer = 1.96915591347127e-315, [...] | 398561594 | NA | integer list(string = NA, integer = 1.96915591347127e-315, [...] a0e3ad5a-a1af-4f58-bda2-a2a468db5b91 | 4876829298 | list(string = "", integer = 0, provided = "string" [...] | NA | | string
How should we deal with 715581797 variants in "bioSurvey_v1" collection?
And more complicated, there're mixed keys inside some of above keys:
We should have clear specs about concept ID 715581797 and format of data inside it. Current implementation need updates to get cleaner data in future. For the time being, how should we clean up these data? @FrogGirl1123 @brotzmanmj
@we-ai I will defer to @FrogGirl1123 on this. I don't have context around this. But here are my suggestions:
For the top level keys,
D_XXXXXXXXX_n_n
should become D_XXXXXXXXX_n
D_XXXXXXXXX_V2_n_n
should become D_XXXXXXXXX_V2_n
For the nested keys:
D_XXXXXXXXX_n_n
, D_XXXXXXXXX_n
should be consolidated to D_XXXXXXXXX_n
for each instance where both forms are present as long as only one of them is populated for a given participant. Hi @we-ai, I'm a little confused with the exchange. I thought the issue was having having any variable where some responses are string and others are integers. For each of these variables, the all data should be EITHER string or integer. The expected format (data type) for each variable is in the data dictionary (Char = String, Num = Integer). All variables with a ## are responses to loops, e.g. disease for each child, _1_1 is for the first child, _2_2 the second and so forth. These sets of variables should all have the same data type. For example, if D_715581797_1_1 is listed as a Num response in the DD, then D_715581797_2_2, and D_715581797_3_3, etc. would also have a Num data type. If you need any assistance finding the format/data type information in the DD, please reach out to @jacobmpeters, @jeannewu or @KELSEYDOWLING7
As far as updating the variable names, then I agree with Jake's proposed variable names.
Hi @FrogGirl1123, you're right about mixed string and integers. Those have been corrected. Unfortunately, for D_715581797_1_1, it's never used as a Num type. It has mixed strings and objects as values. Please check the data in details. And let me know whether these need corrections.
My original plan was to clean up mixed strings and integers. But since @jeannewu brought up concept IDs like D_715581797_1_1 in above comments, I had to check and provide my comments. These data have mixed strings and objects, with objects having variants of keys as mentioned above. Some of the data are confusing to me:
- Why do we need a V2 of 715581797 (eg D_715581797_V2_1_1)?
- Can V2 data be merged to V1 (eg D_715581797_1_1)?
My understanding is that these V2 tags indicate that the question was changed at some point.
- It's concise to use top level of 715581797 as D_XXXXXXXXX_n. But do we really need nested concept IDs as D_XXXXXXXXX_n? D_XXXXXXXXX should be sufficient, right?
Good point. Yes, if the nested variable has an '_n' tag, then the top level does not need one.
@FrogGirl1123 @we-ai
I noticed that each of these covid-related variables in the blood/urine
survey (aka bioSurvey_v1) are deprecated (as of 7/5/23) in the data dictionary and that similar variables are found in the COVID-19
survey. Does this mean that these data are obsolete?
Some background here might be helpful: We used to have COVID questions in the biospecimen surveys. The survey was taking too long to complete during the specimen visit. So we moved the COVID questions into a separate survey. Participants who completed the biospecimens survey before the COVID questions were moved out, still have valid data for the covid questions within the biospecimen survey. Participants who did the biospecimen survey after the covid questions were moved out to their own survey answered those questions in the separate survey. No data are obsolete. Eventually for analysis, covid questions will be combined for analysis regardless of which survey they were administered in.
@brotzmanmj Thanks for the context. That is very helpful.
For "bioSurvey_v1" collection, the mixed types are caused by strings and object (Record) values. For example, there're only 4 docs having "D_715581797_1_1" field as string values, while 9 docs have "D_715581797_1_1" field as object (Record) values. When stretching each of these docs to a row in BQ format, mixed data type issues occur. To clean up "bioSurvey_v1" collection in stage, we need a decision. For docs missing most of data fields (for example Connect IDs 9286750335, 9297845101 etc.), should we delete them or should we add fields with fake data to make data type consistent?
@we-ai If this issue is only occurring in stage
and this part of the survey is no longer in use, then I think either approach is reasonable. Deleting test participant docs to resolve the issue seems simpler to me than adding fake data.
@FrogGirl1123 Do you agree? I'm happy to jump on a call to discuss so that we can give Warren a clear decision.
@we-ai I have higher priority items today and can't jump in here as much as seems needed. Please send a 30 minute invite for you, @jacobmpeters, @brotzmanmj if she would like to join, and me to meet tomorrow after 1pm or Wed. after noon.
@we-ai If this issue is only occurring in stage and this part of the survey is no longer in use, then I think either approach is reasonable. Deleting test participant docs to resolve the issue seems simpler to me than adding fake data.
Sounds reasonable to me. I don't think I need to join a call. You can discuss and let me know your decisions @FrogGirl1123 @jacobmpeters
My original plan was to clean up mixed strings and integers. But since @jeannewu brought up concept IDs like D_715581797_1_1 in above comments, I had to check and provide my comments. These data have mixed strings and objects, with objects having variants of keys as mentioned above. Some of the data are confusing to me:
- Why do we need V2 of 715581797 (eg D_715581797_V2_1_1)?
- Can V2 data be merged to V1 (eg D_715581797_1_1)?
- It's concise to use top level of 715581797 as D_XXXXXXXXX_n. But do we really need nested concept IDs as D_XXXXXXXXX_n? D_XXXXXXXXX should be sufficient as a nested key, right?
@we-ai I wouldn't recommend merging the V1 and V2 variables. The variables are often versioned because the response data type has changed
When I checked for the mixed typed variables (string, integer and provided), in the BioSurvey_v1 data, data structures of these nested variables are holding both string, provided, and entity for each variable D_715581797_X_X. That's why I brought up to Warren to double check.
@we-ai I wouldn't recommend merging the V1 and V2 variables. The variables are often versioned because the response data type has changed
I see V2 values are strings in prod (eg "2022-05"), which make more sense to me. And V1 values also have string values in stage (eg "2020-04" ).
It's more reasonable for concept ID 715581797 to have string values (eg D_715581797_1_1: "2020-04"
for connect ID 3350065725 in stage) than object values (eg D_715581797_1_1: {D_652923023_1_1: 2022, D_701387353_1_1: 5}
for connect ID 8663702757 in prod).
We have this opportunity clean up data of concept ID 715581797 and use strings consistently (use V2 format). Or should we keep using these 2 versions as we're now?
All the string values I have seen for this set of variables are date time (yyyy/mm/dd), which matches the question for this variable root, while D_715581797_X_X_entity is nested, too
@we-ai I wouldn't recommend merging the V1 and V2 variables. The variables are often versioned because the response data type has changed
I see V2 values are strings in prod (eg "2022-05"), which make more sense to me. And V1 values also have string values in stage (eg "2020-04" ). It's more reasonable for concept ID 715581797 to have string values (eg
D_715581797_1_1: "2020-04"
for connect ID 3350065725 in stage) than object values (egD_715581797_1_1: {D_652923023_1_1: 2022, D_701387353_1_1: 5}
for connect ID 8663702757 in prod). We have this opportunity clean up data of concept ID 715581797 and use strings consistently (use V2 format). Or should we keep using these 2 versions as we're now?
I see your point. That variable went from two separate variables (one for month and one for year) to one year-month variable (v2). I'm ok with either scenario as long as the data is maintained. I don't see us doing any comparative analytics with response rate before and after the versioning
@we-ai Thanks for your work on this!
I spoke with @FrogGirl1123 and we have some decisions:
For the D_715581797_n_n
variables in the bioSurvey_v1 table in stg
, let's ensure that they all are given as objects, rather than strings. For those records that are currently strings, you can delete those fields so that the 715581797
is either absent or an object across all participants.
This will ensure that the v1 variables all have the same data type.
As far as the concerns about renaming/consolidating the variants (e.g., D_715581797_n_n
& D_715581797_n
) or consolidating v1 and v2 variables, let's save that for another issue. There are many similar issues with multiple name variants in other modules as well, so we should address those together.
@FrogGirl1123 Please confirm that this is ok with you.
Yes, @jacobmpeters and @we-ai, I confirm all of the above is alright with me.
Based on above decisions, stage bioSurvey_v1
collection in Firestore was cleaned up, with details below:
Please let me know if you see problems. @jacobmpeters @FrogGirl1123
@we-ai I checked the bioSurvey_v1
table in BQ-stg and it appears that your updates have resolved the mixed data types for the 715581797 variants.
It looks like we still have a mixed data type for D_543780863
in module2_v1 in STG.
On May 3, I replied @jeannewu and you @jacobmpeters in an email with info related to module2_v1 in stage: For module2_v1, I changed D_543780863[0] value from string to integer. But it may not be final, because I feel value of D_543780863 has mixed string and array.
Is it true that D_543780863 in module2_v1 has mixed string and array? Or there're mixed data types inside the array? Please help me with more details.
@we-ai This variable D_543780863 the module2_v1 has different structures: integer, entity shown in its provided this time. I think its entity (nested components) are similar as the previous version as "character". One participant (Connect ID:9941741740) had an integer entry for this variable different from others with the entry as "181769837". And this concept ID in this entry seems a new one compared with the previous versions based on DD. I think it might be the reason generating this mixed data type entries under this variable. it will be helpful to make D_543780863 in one unique type. All the values (absolute values) of this variable are as correct as expected based on the DD. In this scenario, this mixed data type entries under this variable might happen in prod or dev if this variable has "181769837" entry there.
In stage Firestore module2_v1, for Connect ID 9941741740, D_543780863[0] was changed back to string ("181769837"). Again, likely the issue is not solved. You can check D_543780863 in module2_v1 of stage and prod. In stage it has values of array (with possible mixed types in objects of the array), while in prod it has values of objects. I feel this should be treated as an separate issue since further checks and decisions are needed before doing cleanups. @jacobmpeters @jeannewu
@we-ai D_543780863 has no this case in either environment of "prod" or "dev".
For Connect ID 9941741740 in stage module2_v1, D_543780863[0] value was changed from string to integer on May 3, and was changed back to string earlier today. The changed data (to string) is in BQ now. You can see neither integer nor string fixes data issue for D_543780863 in module2_v1. How to fix it? @jeannewu @jacobmpeters
@we-ai here attached is a copy of the comparison of the data structures of D_543780863 in prod and stg. Maybe this comparison might be helpful. M2v1_D_543780863_mixed_data_type_comparison_prod_stg_05132024.xlsx
While doing data destruction testing, we noticed inconsistent data types in the participants table:
d_187894482_provided d_254109640_provided d_271757434_provided d_480305327_provided d_506826178_provided d_569151507_d_187894482_provided d_569151507_d_271757434_provided d_569151507_d_506826178_provided d_569151507_d_646873644_provided d_569151507_d_983278853_provided d_646873644_provided d_983278853_provided
Note from @jacobmpeters This is the issue that Warren is referring to.. For this variable (d_480305327), Firestore has both string and integer entries, so we end up with 3 variables in the flattened table for that CID. It would be best to deal with them in Firestore in my opinion.