Analyticsphere / bq2

SQL query development for Connect for Cancer Preventions' BQ2 database.
https://analyticsphere.github.io/bq2/
0 stars 1 forks source link

Correct CIDs for Module 1 Surgery Variables #4

Open jacobmpeters opened 10 months ago

jacobmpeters commented 10 months ago

Paraphrased from Kelsey on MS Teams:

Description:

The surgery source questions are identified by the following CIDs:

The tertiary CIDs for AND are incorrect in the raw data.

623218391 and 802622485 are correct, while 206625031 and 261863326 are not.

This inconsistency was due to a harmonization issue identified too late, leading to the presence of both sets of variables in the dataset.

Action Items:

Additional Notes:

Please be aware that addressing this issue in BQ2 might help, but given that the incorrect variables are present in the raw tables, further discussions with Tony may be necessary.

Data Dictionary [surgery variables only]:

surgery_variables_harmonization_issue.xlsx

To view example of problem:

SELECT
  D_150352141_D_623218391,
  D_150352141_D_206625031
FROM 
  `FlatConnect.merged_module1`
WHERE
  D_150352141_D_623218391 IS NOT NULL
  OR D_150352141_D_206625031 IS NOT NULL
jacobmpeters commented 10 months ago

From Jing in MS Teams:

This is a similar issue:

[Thursday 12:53 PM] Wu, Jing (NIH/NCI) [C] HI, Nicole, and Jake, I have one question on the variable d_317093647 (How old was your mother when they were first told by a doctor or other health professional that they have or had esophageal cancer? as the conceptId.2) in module 1. This variable seems changed to be a nested variable, associated with age (206625031) or year (261863326) at diagnosis now based on the current master DD. However, neither of these nested variables has a solid entry in the module 1 version 2 or verison 1 data. All the entries by age or year are still under the variable d_317093647 in bother versions of module 1 data. Just here to make a note on this variable.

jeannewu commented 10 months ago

@jacobmpeters Thank you very much for assigning me with this investigation. I will check them in the module 1 data and get back to you later.

jeannewu commented 10 months ago

All these CID related variables have data in the module1_v2_JP (except D_527057404_D_206625031). However, these CID associated 206625031) or year (261863326) are only available in the module1_v1_JP data. Now we might combine these variables into the updated CIDs 623218391 and 802622485 if all the entries are correct.

jacobmpeters commented 10 months ago

Thanks for looking into this @jeannewu.

jacobmpeters commented 10 months ago

@jeannewu Do you think we will be able to use coalesce to combine these as well?

SELECT
     -- Combine AGEs
     COALESCE(D_150352141_D_623218391, D_150352141_D_206625031) AS D_150352141_D_623218391,
     -- Combine YEARs
     COALESCE(D_150352141_D_802622485, D_150352141_D_261863326) AS D_150352141_D_802622485,
     -- same for other pairs...
FROM 
     FlatConnect.merged_module1
jeannewu commented 10 months ago

HI, Jake, since the variables in the flat data are strings, we can use coalesce (The COALESCE() function returns the first non-null value in a list.) to combine them if one of them is null.

jacobmpeters commented 10 months ago

Notes from Nicole:

rebexxxxxx commented 10 months ago

We need to incorporate this coalesce() function and corrected surgery variables into the bq2 query generation. This is achievable by following the steps:

jacobmpeters commented 10 months ago

@rebexxxxxx I noticed that Jing's script has an R function related to populating generating the COALESCE(a, b) AS a SQL statements. Perhaps you could review what she's written and see if it works for your use case as well. Alternatively, you can work on a function that suits your style and we could discuss which to use.

https://github.com/jeannewu/Jeanconnect/blob/ac4d09b26bcd65e49a18c3261b89654174d9126a/M1_variables_duplicates_checks_12142023.Rmd#L575-L602

rebexxxxxx commented 10 months ago

hi @jeannewu Where did you find the "incorrect" sub-variables for each of the surgery variables? for example, the concept ID: 150352141 has an incorrect variable: 206625031 that must be coalesced with the variable 623218391. which will look like this:

COALESCE(D_150352141_D_623218391,D_150352141_D_206625031) AS D_150352141_D_623218391

I have looked in the DD and do not see the "incorrect" variable listed for the surgery variables of interest, only the correct variables. where can i find the corresponding "incorrect" variables for each surgery variable? thanks!

jeannewu commented 10 months ago

@rebexxxxxx Please see my sql code here (as an example)

SELECT 
   D_150352141_D_623218391, 
   D_150352141_D_206625031,  
   coalesce(D_150352141_D_623218391, D_150352141_D_206625031) as combo 
FROM `nih-nci-dceg-connect-prod-6d04.FlatConnect.module1_v2_JP`
jeannewu commented 10 months ago

@rebexxxxxx these CID associated 206625031) or year (261863326) are the deprecated ones which are shown in the most recent master DD with the combo for D_150352141. Therefore, we need to update these two sets of variables under the current CIDs (623218391 and 802622485 are correct as Jake mentioned above). Maybe you can double check them in the summary tables I sent to you today.

rebexxxxxx commented 10 months ago

hi @jeannewu, where did you locate the variable D_150352141_D_206625031 and know that it was the incorrect version of D_150352141_D_623218391?

jeannewu commented 10 months ago

@rebexxxxxx you can check the light blue chunk in my summary sheet "M1V2", where the D_150352141_D_206625031 is shown as "not available on the combination of 15035241 with 206625031 in the current master DD, while the D_150352141_D_623218391 is well linked with conceptId.2 and conceptId.3 in the master DD.

jacobmpeters commented 10 months ago

Hi @rebexxxxxx, maybe I can help clarify this.

Kelsey originally told us about these variables not matching the data dictionary. We can query the list of variables that are associated with each surgery:

SELECT
  column_name
FROM
  `nih-nci-dceg-connect-prod-6d04.FlatConnect`.INFORMATION_SCHEMA.COLUMNS
WHERE
  table_name = 'module1_v2_JP'
  AND (
       column_name LIKE '%150352141%'
    OR column_name LIKE '%122887481%'
    OR column_name LIKE '%534007917%'
    OR column_name LIKE '%752636038%'
    OR column_name LIKE '%518750011%'
    OR column_name LIKE '%275770221%'
    OR column_name LIKE '%527057404%'
)

We end up with this list of variables:

Row column_name  
1 D_122887481_D_206625031  
2 D_122887481_D_261863326  
3 D_122887481_D_623218391  
4 D_122887481_D_802622485  
5 D_150352141_D_206625031  
6 D_150352141_D_261863326  
7 D_150352141_D_623218391  
8 D_150352141_D_802622485  
9 D_275770221_D_206625031  
10 D_275770221_D_261863326  
11 D_275770221_D_623218391  
12 D_275770221_D_802622485  
13 D_518750011_D_206625031  
14 D_518750011_D_261863326  
15 D_518750011_D_623218391  
16 D_518750011_D_802622485  
17 D_527057404_D_206625031
18 D_527057404_D_261863326  
19 D_527057404_D_623218391  
20 D_527057404_D_802622485  
21 D_534007917_D_206625031  
22 D_534007917_D_261863326  
23 D_534007917_D_623218391  
24 D_534007917_D_802622485  
25 D_752636038_D_206625031  
26 D_752636038_D_261863326  
27 D_752636038_D_623218391  
28 D_752636038_D_802622485

The variables terminating in 623218391 and 802622485 are correct (i.e., they are listed in the dictionary). The variables terminating in 206625031 and 261863326 are incorrectly labelled.

Here is how they are labelled in the data dictionary: surgery_variables_harmonization_issue.xlsx

rebexxxxxx commented 10 months ago

@jacobmpeters @jeannewu thank you for the clarification! i thought that each surgery variable had its own unique incorrect variables that had to be remediated. thank you!

jacobmpeters commented 10 months ago

@rebexxxxxx has a working solution. Needs to test output.

jeannewu commented 10 months ago

@jacobmpeters, I have just checked that only these seven pairs of CID2_CID3/CID2_CID2_CID3 need to coalesce as one based on the schema I downloaded yesterday: image

rebexxxxxx commented 10 months ago

**i need to check and see if altering the variable in the 2nd half of the query affects the original module_v1/2 table or if it only affects the "transition table". after speaking with @jeannewu she thinks its a better idea for the coalesce() statements to be called in the "transition" dataset (i.e. the first half of the module 1 query). this makes a lot more sense to me. if the coalesce() statements are included in the 2nd half of the query where joins are completed, then the original module1_v2 table will be augmented, which we do not want to do. i will change my solution to this problem again and move the coalesce() statements to only affect the transition dataset generated at the beginning of the query. thanks @jeannewu !

rebexxxxxx commented 10 months ago

we have a question for @jacobmpeters, is it possible for these variables to be re-written in BQ1 instead of us correcting them each time we want to update the BQ2 query/tables?

jacobmpeters commented 10 months ago

@rebexxxxxx This is a good question. I have thought a lot about this.. My current feeling is that the Flattened datasets should be a faithful representation of the upstream data in Firestore, but in flattened form. All renaming, curation, derivation, cleaning should be completed based on these flattened tables. We can certainly have a cleaned table in BQ1 on which BQ2 is based. This aligns with the diagram that I drew on the wiki.

We should discuss further during an upcoming meeting, but this is my quick answer.

rebexxxxxx commented 10 months ago

@jeannewu i have looked into using the COALESCE() statements. in your example, you were concerned about the variable D_150352141_D_623218391 being overwritten in the source module 1 v2 table, however, i found that the COALESCE() function does not touch the source table, only the output table being generated:

e.g. this does not overwrite the variable in module 1 v2 table, and only augments the output we are generating: COALESCE(v2.D_150352141_D_623218391, v2.D_150352141_D_206625031) AS D_150352141_D_623218391

In fact, i get an error when i try to use the same name of one of the coalesced variables as the alias name. the following code will not run and will produce an error: COALESCE(v2.D_150352141_D_623218391, v2.D_150352141_D_206625031) AS v2.D_150352141_D_623218391

knowing this i think i can move forward with my fix of this issue. i will not post my solution until me you and jake meet tomorrow afternoon at 2:30 since you and i spoke about how to integrate my code with your module 1 v1+v2 merge.

jeannewu commented 10 months ago

@rebexxxxxx thank you very much for your updates. This is just what I am concerned with the variables naming with the same name before and after 'coalesce'. I would use a different transitional name in this data process in a transitional data and then update back with the values of this transitional variable with the targeted variable in the original table when the data process is satisfied and correct

rebexxxxxx commented 9 months ago

This code is done, but i would like to review it within the context of issue #17. Since this code is only being used to clean up variables in module 1, I want to review it when the module 1 query generation file has been created since that is where this fix will be used. Once i have the module 1 sql query generation file set up, I will present my fix to this problem.