Analyticsphere / bq2

SQL query development for Connect for Cancer Preventions' BQ2 database.
https://analyticsphere.github.io/bq2/
0 stars 1 forks source link

Both YEAR and AGE in single column [D_317093647] #8

Open jacobmpeters opened 10 months ago

jacobmpeters commented 10 months ago

From Jing @jeannewu in MS Teams:

[Thursday 12:53 PM] Wu, Jing (NIH/NCI) [C] HI, Nicole, and Jake, I have one question on the variable d_317093647 (How old was your mother when they were first told by a doctor or other health professional that they have or had esophageal cancer? as the conceptId.2) in module 1. This variable seems changed to be a nested variable, associated with age (206625031) or year (261863326) at diagnosis now based on the current master DD. However, neither of these nested variables has a solid entry in the module 1 version 2 or verison 1 data. All the entries by age or year are still under the variable d_317093647 in bother versions of module 1 data. Just here to make a note on this variable.

Originally posted by @jacobmpeters in https://github.com/Analyticsphere/bq2/issues/4#issuecomment-1870534028

To view problem:

SELECT d_317093647 
FROM `nih-nci-dceg-connect-prod-6d04.FlatConnect.module1_v2_JP` 
WHERE d_317093647 IS NOT NULL
jacobmpeters commented 10 months ago

I moved this one to a separate issue since it is slightly different from the one in #4.

jacobmpeters commented 10 months ago

When we confirm with @FrogGirl1123 that these should be partitioned, we can use the following SQL to do so:

SELECT
  -- Variable with mix of ages (e.g., 55) and years (e.g., 1987)
  D_317093647,

  -- Case when its an age
  CASE
    WHEN REGEXP_CONTAINS(D_317093647, r'^\d{1,2}$') 
      AND CAST(D_317093647 AS INT64) BETWEEN 0 AND 99 
        THEN CAST(D_317093647 AS INT64)
    ELSE
      NULL
  END AS D_317093647_D_206625031,

  -- Case when its a year
  CASE
    WHEN REGEXP_CONTAINS(D_317093647, r'^\d{4}$') 
      THEN CAST(D_317093647 AS INT64)
  ELSE
    NULL
  END AS D_317093647_D_261863326

FROM
  `nih-nci-dceg-connect-prod-6d04.FlatConnect.merged_module1`
WHERE
  D_317093647 IS NOT NULL

This assumes that ages are between 0-99. It will need to be tweaked if this is not the case. Some additional checks for years might also be appropriate here..

jeannewu commented 10 months ago

@jacobmpeters Thanks a lot. You have made such a nice patch on his issue of this variable I've just thought about. But I think the variables as 623218391 & 802622485 might be more updated than this pair of 206625031 and 261863326. How is your opinion?

jacobmpeters commented 10 months ago

@jeannewu Good catch! I agree. Thanks for looking it over so quickly.

I'll update those below:

SELECT
  -- Variable with mix of ages (e.g., 55) and years (e.g., 1987)
  D_317093647,

  -- Case when its an age
  CASE
    WHEN REGEXP_CONTAINS(D_317093647, r'^\d{1,2}$') 
      AND CAST(D_317093647 AS INT64) BETWEEN 0 AND 99 
        THEN CAST(D_317093647 AS INT64)
    ELSE
      NULL
  END AS D_317093647_D_623218391,

  -- Case when its a year
  CASE
    WHEN REGEXP_CONTAINS(D_317093647, r'^\d{4}$') 
      THEN CAST(D_317093647 AS INT64)
  ELSE
    NULL
  END AS D_317093647_D_802622485

FROM
  `nih-nci-dceg-connect-prod-6d04.FlatConnect.merged_module1`
WHERE
  D_317093647 IS NOT NULL
jeannewu commented 10 months ago

@jacobmpeters Just talked with Kelsey on this variable, Kelsey said she would let Tony check and fix this variable in Firestore. I also checked in the Connect.module1_v2: there are more entries under this variable recently. We need let Tony know this issue and stop and fix it at the beginning. And we can fix the current ones for our BQ2 if it is not too much burden. How is your opinion?

jacobmpeters commented 10 months ago

Thanks for the update @jeannewu.

It would be great if Tony can fix it so that this issue doesn't occur for new data. But, unless he also plans to correct the old data in Firestore, I think we will still need to correct this for BQ2. Can you or @KELSEYDOWLING7 update this issue after the conversation with Tony?

jacobmpeters commented 10 months ago

Notes from Nicole: