Closed FrogGirl1123 closed 10 months ago
@FrogGirl1123 May I add "token" in the csv file in case someone doesn't have his/her Connect_ID by accident?
@FrogGirl1123 @brotzmanmj may you please give me a folder in the box to save a copy of the birthday card csv file if possible? Thanks
@FrogGirl1123 @brotzmanmj may you please assign a folder in Box to test and review the birthday card csv file if possible? Thanks a lot
@FrogGirl1123 Do we want the name and address to be in a single field as a printable address label?
Example from dev:
@FrogGirl1123 @jeannewu
I set up a pipeline to deliver these data to Box in STG.
Please take a look and provide feedback:
Box file: https://nih.app.box.com/file/1379407696559
BigQuery SQL [modified from Jing's query]: https://console.cloud.google.com/bigquery?sq=810867956467:209fa7c3d822486e814623a310dcfe7a
Cloud Function for Delivery to Box: https://console.cloud.google.com/functions/details/us-central1/exportBirthdayCardInfoToBox?env=gen1&project=nih-nci-dceg-connect-stg-5519&tab=source
@jacobmpeters Nice! Thanks a lot.
Hi, Jake, the csv file name with Madhuri's suggestion would be like "CSV file for ",month," Birthday Card Recipients_2024.csv". If possible, would you like to help me update the name of the csv file? Thanks a lot.
Hi @jeannewu. Thank you for letting me know. I wasn't aware of this requirement. I can update the CSV name. I will replace the spaces with underscores.
Here is an updated CSV file with the new name: https://nih.app.box.com/file/1379431144150
Filename: CSV_file_for_December_2023_birthday_card_recipients_000000000000.csv
Note: The "_000000000000" is added by BigQuery. If the file size is too large, BigQuery will automatically clip the data into multiple CSV files and increment this tag by 1 for each additional file produced. We should not remove it until we know how this scales in production.
Jing mentioned that Madhuri and/or the Operations team would not to have spaces in the filename. If this is the case, it is possible to do this:
"CSV file for December 2023 Birthday Card Recipients_000000000000.csv" https://nih.app.box.com/file/1379437544841
Generally speaking it's not good practice to have spaces in filenames because different programs and operating systems handle spaces differently. It happens to work for my pipeline, but it might not work for the users'.
@jacobmpeters i think in that case it's fine to have "_" separating the words in the filename!
@mnataraj92 Ok! Sounds good. I'll switch it back :)
It should just be "CSV_filefor[insert month here]_Birthday_Card_Recipients_2024" so January would be "CSV_file_for_January_Birthday_Card_Recipients_2024"
Thanks!!
Thank you very much for your helpful comments. And right now, the code is set up at the corresponding month of each Birth month to pull out the list. Please let us know if the list would be prepared in advance (say one month earlier). It will be helpful for us to update the script. Thanks a lot.
It should just be "CSV_filefor[insert month here]_Birthday_Card_Recipients_2024" so January would be "CSV_file_for_January_Birthday_Card_Recipients_2024"
Thanks!!
Hi Madhuri. Here is the updated version: CSV_file_for_December_Birthday_Card_Recipients_2023_000000000000.csv
Hi, Jake, in case for the preparation of the Birthday Card in the coming month, I have updated my script as ,"SELECT Connect_ID, token, d_564964481, d_996038075,d_399159511, d_521824358,d_442166669,d_703385619,d_635101039,d_892050548 FROM nih-nci-dceg-connect-prod-6d04.FlatConnect.participants_JP
WHERE d_821247024 = '197316935' and cast(d_564964481 as int64) = EXTRACT(month FROM date_add(date_trunc(current_date(),month),interval+1 month)) and d_747006172 != '353358909' and d_987563196 != '353358909' and d_827220437 != '809703864'" If possible, please help us update the extraction month in the script (as above). Thanks a lot.
@FrogGirl1123 I talked with you this afternoon about the preparation for the January list of Birthday card. As we agreed, we might get the list one month earlier for the NORC birthday participants in the coming month, right?
Hi @jacobmpeters and @jeannewu I'll email NORC and get answers to your questions
@FrogGirl1123 I talked with you this afternoon about the preparation for the January list of Birthday card. As we agreed, we might get the list one month earlier for the NORC birthday participants in the coming month, right?
Hi @jeannewu and @FrogGirl1123, I incorperated Jing's suggestion. The query now exports next month's birthday list:
CSV_file_for_January_Birthday_Card_Recipients_2023_000000000000.csv
Hi @jacobmpeters and @jeannewu
Here is what NORC said in regards to how they'd like the addresses & when they would like the csv file by
1) Please send as columns: first_name, last_name, address_line_1, address_line_2, city, and zip_code 2) To get the cards out early each month, please provide the CSV files by the 22nd of each prior month a. For example, we’ll need the January file by December 22nd
@mnataraj92 Thanks for letting us know. This is very clear and doable!
I would like to set up an additional test folder for the production data tomorrow so that we can see how the CSV delivery scales up before we share to the NORC folder.
Could you please make a test folder for prod and invite Jing and me as "editors"? Could you also add AutomationUser_1960507_dBBoRFHiim@boxdevedition.com as a "viewer/uploader"?
@jacobmpeters done!
@mnataraj92 Thanks! I noticed that there are two NORC folks included on this test folder. I think I'd like to just test it with a folder with only a few Connect folks, so that we can make sure it's good-to-go before we share with NORC. What do you think?
I'd also like to review a test file from dev and then from stage before we run in prod. Madhuri and I will review the data in dev and stage. Please let us know when it's ready. For the initial review, we will need a way to ensure that all the requirements were met. Can you run frequency counts to indicate how many verified participants there were with birthdays in the month, how many were withdrawn and how many deceased? That will help us know that the number in the file produced is correct. Or if there is another way to show us this for confirmation, I'm open to that.
After we have done that, we should also share a test file from dev or stage with NORC before running it in prod so they can confirm they can consume the format we provide.
@jacobmpeters @mnataraj92 would you like to test directly with sql or R code? I will update and synch the R code to the git soon for the coming test.
@mnataraj92 in your previous message from the NORC feedback, the State of the participants' address would be included in the Birthday Card list, right?
Hi all,
I spoke with Michelle and Jake. @jacobmpeters I'll create a separate secure folder for the prod testing and will let you know when that is ready.
While you're testing that, Michelle and I can check the stage data once the frequency counts Michelle requested are generated. @jeannewu can you please generate the frequency counts that Michelle requested above? Once that is complete, please let me know, and Michelle and I will check the data to confirm accuracy. Jake mentioned that the most recent stage file currently has an extra column that he will need to remove with the full name and address in one string, but that should be fixed tomorrow.
We can then share a test file with NORC from stage to make sure everything looks good to them.
@brotzmanmj @mnataraj92 based on today's data, the NORC verified participants are counted in each database: prod, stg, dev listed |
birth_ID-prod | birth_ID_stage | birth_ID_dev |
---|---|---|---|
1 | 1350 | 62 | 104 |
2 | 1320 | 26 | 48 |
3 | 1462 | 38 | 51 |
4 | 1393 | 34 | 47 |
5 | 1479 | 34 | 32 |
6 | 1526 | 34 | 36 |
7 | 1562 | 38 | 38 |
8 | 1538 | 42 | 31 |
9 | 1623 | 41 | 31 |
10 | 1516 | 33 | 33 |
11 | 1325 | 34 | 42 |
12 | 1508 | 17 | 14 |
month code error | 1 | 0 | 0 |
total Birthday (verified, with consent, not deceased from NORC) | 17603 | 433 | 507 |
deceased (overall) | 2 (NORC) | 1 (NORC) | 9 (1 Uoc, 8 NORC including NIH) |
consent withdrawls (overall) | 253 (6 UoC, 247 NORC) | 20 (15 NORC + 5 UoC) | 30 (1 UoC, 29 NORC + NIH) |
Hope the counts of Birthday in NORC undeceived, verified participants without consent withdrawals make sense.
This file in stage complies with NORCS formatting requests and is ready to be checked. https://nih.app.box.com/file/1381312677259
The following filters were applied:
WHERE # FILTERS:
d_821247024 = '197316935' # Verifiation Status must be Verified
AND birth_month = next_month_as_integer # Birth Month must be Next Month
AND d_747006172 != '353358909' # Consent is NOT Withdrawn
AND d_987563196 != '353358909' # Participant is NOT Deceased
AND d_827220437 != '809703864' # Healthcare provider is NOT University of Chicago
@FrogGirl1123 Mentioned that University of Chicago should not be excluded, so I updated the filter.
This is the updated file: https://nih.app.box.com/file/1381327014282
WHERE # FILTERS:
d_821247024 = '197316935' # Verifiation Status must be Verified
AND birth_month = next_month_as_integer # Birth Month must be Next Month
AND d_747006172 != '353358909' # Consent is NOT Withdrawn
AND d_987563196 != '353358909' # Participant is NOT Deceased
Just confirmed birth month you're using is 564964481?
@brotzmanmj Correct. I'll share the whole query in case you have more questions about CIDs.
-- NORC Birthday Card List
-- Authors: Jing Wu and Jake Peters
-- Description:
-- Get the Connect_ID, name and address of participants who have a birthday this month.
-- Include only participants (1) that are verified, (2) that have not withdrawn consent, and
-- (3) that are not deceased.
# CSV_file_for_January_Birthday_Card_Recipients_2024
DECLARE next_month_as_string STRING DEFAULT FORMAT_DATE('%B', DATE_ADD(CURRENT_DATE, INTERVAL 1 MONTH));
DECLARE next_month_as_integer INT64 DEFAULT CAST(EXTRACT(month FROM DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH)) AS INT64);
DECLARE uri_with_date STRING DEFAULT CONCAT(
'gs://birthday_card_info/CSV_file_for_', next_month_as_string, '_Birthday_Card_Recipients_', FORMAT_DATE('%Y', CURRENT_DATE), '_boxfolder_237836684699_*.csv'
);
EXPORT DATA OPTIONS (uri = CAST(uri_with_date AS STRING), format = 'CSV', overwrite = true, header = true, field_delimiter = ',') AS
WITH AddressInfo AS(
SELECT
Connect_ID,
token,
CAST(d_564964481 AS INT64) AS birth_month,
d_996038075 AS last_name,
d_399159511 AS first_name,
d_521824358 AS address_line_1,
d_442166669 AS address_line_2,
d_703385619 AS city,
d_634434746 AS state,
d_892050548 AS zip_code,
d_821247024,
d_747006172,
d_987563196,
d_827220437
FROM
`nih-nci-dceg-connect-stg-5519.FlatConnect.participants_JP`)
SELECT
-- Connect_ID,
first_name,
last_name,
address_line_1,
address_line_2,
city,
state,
zip_code,
-- CONCAT(first_name, ' ', last_name, '\n',
-- address_line_1, IFNULL(CONCAT('\n', address_line_2), ''), '\n',
-- city, ', ', state, ' ', zip_code)
-- AS full_address_label,
FROM AddressInfo
WHERE # FILTERS:
d_821247024 = '197316935' # Verifiation Status must be Verified
AND birth_month = next_month_as_integer # Birth Month must be Next Month
AND d_747006172 != '353358909' # Consent is NOT Withdrawn
AND d_987563196 != '353358909' # Participant is NOT Deceased
;
Thanks for sharing!
@brotzmanmj CID Variable Label
Variable Name
@mnataraj92 @FrogGirl1123
I mentioned that BigQuery chops the Birthday card data into many CSV files when the data are exported via scheduled query from Prod. I set up a Cloud Function to concatenate the files and export them to a specified folder in Box.
The initial file has ben exported to a test folder set up by Madhuri (with only a few internal viewers).
https://nih.app.box.com/folder/238384827882
It is now ready to be independently checked @jeannewu , but @mnataraj92 will need to give you access first.
The scheduled query is configured to export data for the following month on the 22 of each month. When we confirm that the data look as expected, we just need to switch swap out the test_folder_id for the for the norc_folder_id.
I have given @jeannewu access
@jacobmpeters @mnataraj92 Thanks a lot. I have seen the birthday card data (file id =1387128776918) in the folder 238384827882. Just check This file is for the January Birthday Card (n=1618) of nine Connect Sites including UoC.
Lauren from NORC mentioned a few additional columns she'd like added:
There are three fields we need added to the CSV file:
@mnataraj92 I will add these back in and generate another test file. I'll let you know when this is done.
@mnataraj92 Just a clarification, PIN should just be the same value as CONNECT_ID, correct?
This is a little confusing because the participants table has a 'pin' field already.. So, I just wanted to double check.
An alternative would be to just have an empty field called PIN where they can generate the random ID number for tracking purposes.
hi @jacobmpeters I just emailed Lauren asking her preference (and cc'd you)
@jacobmpeters Lauren mentioned that the “$PIN” field should have the same values as CONNECT_ID. It will be manipulated when they upload the files into their case management system.
The updated CSV file for stage and prod have been exported to the appropriate box folders. Stage Testing: https://nih.app.box.com/file/1392302711509 Prod Testing: https://nih.app.box.com/file/1392309347706
Thanks!! I'll ask Lauren to take a look at the stage file to make sure everything looks good to her
The updated CSV file for stage and prod have been exported to the appropriate box folders. Stage Testing: https://nih.app.box.com/file/1392302711509 Prod Testing: https://nih.app.box.com/file/1392309347706
Oops. I have to add the leading zeros for January - September
Leading zeros are now implemented for DOBM for both stage and prod.
Lauren said the stage file looks good!
Great! After the punch list tomorrow, if everyone approves of the data, I'll point the delivery to the NORC folder so that it will deliver as scheduled on the 22nd of every month.
The Birthday Card CSV delivery pipeline is now pointed at the NORC-facing box folder (https://nih.app.box.com/folder/237836526066). It is scheduled to deliver at ~8:30 AM on the 22nd of every month. Once the January file is delivered successfully on Dec. 22 we can close this issue.
Starting a new issue separate from #736 as the API will be developed in the future
As decided through email and Punch List conversations on Wed., Nov. 29, we will put a hold on development of the API filter. Instead we will create a csv file with the with the following requirements to be shared with NORC via Box as we do for the BQ1 data for the sites: For all participants who are verified and not withdrawn and have birthdays in a given month (e.g. January for out first shared file). The csv file will contain: Connect ID Name (current first and last name from User Profile) Mailing address (current address from User Profile)
@brotzmanmj please add any additional requirements and let @mnataraj92 where to set up the folder for NORC. Also what would we like as the delivery date, first day of the month, end of the previous month?
@brotzmanmj, @sonyekere, @anthonypetersen should we break this into a new Issue since the API work will resume eventually?
@jeannewu I would like you to write the scheduled query for this table and @jacobmpeters to work on getting the resulting csv into Box
Updates from replies in #736: From @brotzmanmj add "not deceased" as a requirement