Open jacobmpeters opened 10 months ago
From Jing in MS Teams:
This is a similar issue:
[Thursday 12:53 PM] Wu, Jing (NIH/NCI) [C] HI, Nicole, and Jake, I have one question on the variable d_317093647 (How old was your mother when they were first told by a doctor or other health professional that they have or had esophageal cancer? as the conceptId.2) in module 1. This variable seems changed to be a nested variable, associated with age (206625031) or year (261863326) at diagnosis now based on the current master DD. However, neither of these nested variables has a solid entry in the module 1 version 2 or verison 1 data. All the entries by age or year are still under the variable d_317093647 in bother versions of module 1 data. Just here to make a note on this variable.
@jacobmpeters Thank you very much for assigning me with this investigation. I will check them in the module 1 data and get back to you later.
All these CID related variables have data in the module1_v2_JP (except D_527057404_D_206625031). However, these CID associated 206625031) or year (261863326) are only available in the module1_v1_JP data. Now we might combine these variables into the updated CIDs 623218391 and 802622485 if all the entries are correct.
Thanks for looking into this @jeannewu.
@jeannewu Do you think we will be able to use coalesce to combine these as well?
SELECT
-- Combine AGEs
COALESCE(D_150352141_D_623218391, D_150352141_D_206625031) AS D_150352141_D_623218391,
-- Combine YEARs
COALESCE(D_150352141_D_802622485, D_150352141_D_261863326) AS D_150352141_D_802622485,
-- same for other pairs...
FROM
FlatConnect.merged_module1
HI, Jake, since the variables in the flat data are strings, we can use coalesce (The COALESCE() function returns the first non-null value in a list.) to combine them if one of them is null.
Notes from Nicole:
We need to incorporate this coalesce() function and corrected surgery variables into the bq2 query generation. This is achievable by following the steps:
@rebexxxxxx I noticed that Jing's script has an R function related to populating generating the COALESCE(a, b) AS
a SQL statements. Perhaps you could review what she's written and see if it works for your use case as well. Alternatively, you can work on a function that suits your style and we could discuss which to use.
hi @jeannewu Where did you find the "incorrect" sub-variables for each of the surgery variables? for example, the concept ID: 150352141 has an incorrect variable: 206625031 that must be coalesced with the variable 623218391. which will look like this:
COALESCE(D_150352141_D_623218391,D_150352141_D_206625031) AS D_150352141_D_623218391
I have looked in the DD and do not see the "incorrect" variable listed for the surgery variables of interest, only the correct variables. where can i find the corresponding "incorrect" variables for each surgery variable? thanks!
@rebexxxxxx Please see my sql code here (as an example)
SELECT
D_150352141_D_623218391,
D_150352141_D_206625031,
coalesce(D_150352141_D_623218391, D_150352141_D_206625031) as combo
FROM `nih-nci-dceg-connect-prod-6d04.FlatConnect.module1_v2_JP`
@rebexxxxxx these CID associated 206625031) or year (261863326) are the deprecated ones which are shown in the most recent master DD with the combo for D_150352141. Therefore, we need to update these two sets of variables under the current CIDs (623218391 and 802622485 are correct as Jake mentioned above). Maybe you can double check them in the summary tables I sent to you today.
hi @jeannewu, where did you locate the variable D_150352141_D_206625031 and know that it was the incorrect version of D_150352141_D_623218391?
@rebexxxxxx you can check the light blue chunk in my summary sheet "M1V2", where the D_150352141_D_206625031 is shown as "not available on the combination of 15035241 with 206625031 in the current master DD, while the D_150352141_D_623218391 is well linked with conceptId.2 and conceptId.3 in the master DD.
Hi @rebexxxxxx, maybe I can help clarify this.
Kelsey originally told us about these variables not matching the data dictionary. We can query the list of variables that are associated with each surgery:
SELECT
column_name
FROM
`nih-nci-dceg-connect-prod-6d04.FlatConnect`.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name = 'module1_v2_JP'
AND (
column_name LIKE '%150352141%'
OR column_name LIKE '%122887481%'
OR column_name LIKE '%534007917%'
OR column_name LIKE '%752636038%'
OR column_name LIKE '%518750011%'
OR column_name LIKE '%275770221%'
OR column_name LIKE '%527057404%'
)
We end up with this list of variables:
Row | column_name | |
---|---|---|
1 | D_122887481_D_206625031 | |
2 | D_122887481_D_261863326 | |
3 | D_122887481_D_623218391 | |
4 | D_122887481_D_802622485 | |
5 | D_150352141_D_206625031 | |
6 | D_150352141_D_261863326 | |
7 | D_150352141_D_623218391 | |
8 | D_150352141_D_802622485 | |
9 | D_275770221_D_206625031 | |
10 | D_275770221_D_261863326 | |
11 | D_275770221_D_623218391 | |
12 | D_275770221_D_802622485 | |
13 | D_518750011_D_206625031 | |
14 | D_518750011_D_261863326 | |
15 | D_518750011_D_623218391 | |
16 | D_518750011_D_802622485 | |
17 | D_527057404_D_206625031 | |
18 | D_527057404_D_261863326 | |
19 | D_527057404_D_623218391 | |
20 | D_527057404_D_802622485 | |
21 | D_534007917_D_206625031 | |
22 | D_534007917_D_261863326 | |
23 | D_534007917_D_623218391 | |
24 | D_534007917_D_802622485 | |
25 | D_752636038_D_206625031 | |
26 | D_752636038_D_261863326 | |
27 | D_752636038_D_623218391 | |
28 | D_752636038_D_802622485 |
The variables terminating in 623218391 and 802622485 are correct (i.e., they are listed in the dictionary). The variables terminating in 206625031 and 261863326 are incorrectly labelled.
Here is how they are labelled in the data dictionary: surgery_variables_harmonization_issue.xlsx
@jacobmpeters @jeannewu thank you for the clarification! i thought that each surgery variable had its own unique incorrect variables that had to be remediated. thank you!
@rebexxxxxx has a working solution. Needs to test output.
@jacobmpeters, I have just checked that only these seven pairs of CID2_CID3/CID2_CID2_CID3 need to coalesce as one based on the schema I downloaded yesterday: image
**i need to check and see if altering the variable in the 2nd half of the query affects the original module_v1/2 table or if it only affects the "transition table". after speaking with @jeannewu she thinks its a better idea for the coalesce() statements to be called in the "transition" dataset (i.e. the first half of the module 1 query). this makes a lot more sense to me. if the coalesce() statements are included in the 2nd half of the query where joins are completed, then the original module1_v2 table will be augmented, which we do not want to do. i will change my solution to this problem again and move the coalesce() statements to only affect the transition dataset generated at the beginning of the query. thanks @jeannewu !
we have a question for @jacobmpeters, is it possible for these variables to be re-written in BQ1 instead of us correcting them each time we want to update the BQ2 query/tables?
@rebexxxxxx This is a good question. I have thought a lot about this.. My current feeling is that the Flattened datasets should be a faithful representation of the upstream data in Firestore, but in flattened form. All renaming, curation, derivation, cleaning should be completed based on these flattened tables. We can certainly have a cleaned table in BQ1 on which BQ2 is based. This aligns with the diagram that I drew on the wiki.
We should discuss further during an upcoming meeting, but this is my quick answer.
@jeannewu i have looked into using the COALESCE() statements. in your example, you were concerned about the variable D_150352141_D_623218391
being overwritten in the source module 1 v2 table, however, i found that the COALESCE() function does not touch the source table, only the output table being generated:
e.g. this does not overwrite the variable in module 1 v2 table, and only augments the output we are generating:
COALESCE(v2.D_150352141_D_623218391, v2.D_150352141_D_206625031) AS D_150352141_D_623218391
In fact, i get an error when i try to use the same name of one of the coalesced variables as the alias name. the following code will not run and will produce an error:
COALESCE(v2.D_150352141_D_623218391, v2.D_150352141_D_206625031) AS v2.D_150352141_D_623218391
knowing this i think i can move forward with my fix of this issue. i will not post my solution until me you and jake meet tomorrow afternoon at 2:30 since you and i spoke about how to integrate my code with your module 1 v1+v2 merge.
@rebexxxxxx thank you very much for your updates. This is just what I am concerned with the variables naming with the same name before and after 'coalesce'. I would use a different transitional name in this data process in a transitional data and then update back with the values of this transitional variable with the targeted variable in the original table when the data process is satisfied and correct
This code is done, but i would like to review it within the context of issue #17. Since this code is only being used to clean up variables in module 1, I want to review it when the module 1 query generation file has been created since that is where this fix will be used. Once i have the module 1 sql query generation file set up, I will present my fix to this problem.
Paraphrased from Kelsey on MS Teams:
Description:
The surgery source questions are identified by the following CIDs:
The tertiary CIDs for AND are incorrect in the raw data.
623218391 and 802622485 are correct, while 206625031 and 261863326 are not.
This inconsistency was due to a harmonization issue identified too late, leading to the presence of both sets of variables in the dataset.
Action Items:
Additional Notes:
Please be aware that addressing this issue in BQ2 might help, but given that the incorrect variables are present in the raw tables, further discussions with Tony may be necessary.
Data Dictionary [surgery variables only]:
surgery_variables_harmonization_issue.xlsx
To view example of problem: