episphere / connect

Connect API for DCEG's Cohort Study
10 stars 5 forks source link

Stage data destruction stub record, remaining issues #658

Closed robertsamm closed 1 year ago

robertsamm commented 1 year ago

Reviewed stub record in stage for connectID 1475895409 and found a few remaining issues. After the stub record was pushed, it did the same thing in the SMDB where I can't view the Participant Summary page for this participant anymore, I'm not sure if that is intentional or not but I don't see it in the SOP. So looking at this record in the SMDB and the CIDs in the array Jessica sent me here's the few remaining issues I found

  1. Preferred email should not be retained, needs to be removed (I see it in the SMDB and the list of CIDs in the array you sent)
  2. Preferred name should also not be retained, needs to be removed (I see this listed in the array)
  3. Participant demographic variables should not be retained (site reported age, site reported race/ethnicity, site reported sex). I see this in the SMDB but I don't see the CIDs listed in the array you sent. These are all vars sent by the site.
  4. Participant verification table variables for site match vars and campaign type should not be retained (first name match, last name match, DOB match, PIN match, token match, zip code match, age match, cancer status match). These are also all vars sent by the site and I see them filled in still on the SMDB. Should only need to retain verification status and time of verification.
  5. Participant Summary page unclickable after stub record created
  6. I'm not able to check to confirm that the biospecimen variables were not retained without the pt summary page. Can someone doublecheck this data on the backend?
KELSEYDOWLING7 commented 1 year ago

@jhflorey If you can that would be great! I'll double check that the second participant whenever you're done. The raw tables in BQ won't refresh until 4:30pm so it doesn't need to be done right away

jhflorey commented 1 year ago

@KELSEYDOWLING7 It's done for 3231286166. BTW my PR already merge into dev, so i guess you can check it out in dev.

KELSEYDOWLING7 commented 1 year ago

@jhflorey Great, thank you.

It looks like for the first participant, Connect_ID 3994600604, there are some stub records deleted:

-consent middlename, consent suffix -Date of withdrawal -Who requested withdrawal -Middle Name extracted from HIPAA revocation form -Middle Name extracted from Data Destruction form

Checking with @brotzmanmj , the date of withdrawal and who requested the withdrawal deletion seems to make sense, but @kmazzilli confirmed that a middle name was inputted, and so it should not have been deleted. Do you mind looking into this?

jhflorey commented 1 year ago

@KELSEYDOWLING7 i think we should create new participant then testing in dev again.

KELSEYDOWLING7 commented 1 year ago

@kmazzilli Was there another participant from the list I sent that we could use for additional testing?

kmazzilli commented 1 year ago

@KELSEYDOWLING7 Connect ID:2361618927 has samples collected but needs additional surveys filled out. I can fill out the surveys, add to the name fields, go through with the data destruction, and let you know when I am done, if that works.

KELSEYDOWLING7 commented 1 year ago

@kmazzilli That works for me!

jhflorey commented 1 year ago

@KELSEYDOWLING7 have you tested in dev?

KELSEYDOWLING7 commented 1 year ago

@jhflorey Kaitlyn is testing in DEV now, there's some delay with the biospecimen shipment being resolved today. The deletion should happen tonight and I can check the data tomorrow

kmazzilli commented 1 year ago

I was still encountering issues with shipping today - it sounds like a resolution is coming soon and I'll try again tomorrow

Davinkjohnson commented 1 year ago

@brotzmanmj @Davinkjohnson @KELSEYDOWLING7 i'll update code for deleting bioSurvey_v1, clinicalBioSurvey_v1, covid19Survey_v1, menstrualSurvey_v1, module1_v1 module1_v2, module2_v1, module2_v2, module3_v1, module4_v1, ssn, biospecimen now. And the boxes table, i will work on it once we have a final decision.

@jhflorey we previously did not discuss the notifications table. We need to make sure the records related to a data destroyed participant are also removed from notifications. (Use token on this table for the key to participants.)

KELSEYDOWLING7 commented 1 year ago

@kmazzilli Were we able to resolve the shipping issue? And are we planning to do a data destruction push tonight?

KELSEYDOWLING7 commented 1 year ago

@Davinkjohnson @brotzmanmj Would you be able to add Jing to this issue for the backend testing next week?

Also please note the biospecimen data has two documents. I'm not sure if our data destruction code includes deleting all documents if there are multiple

kmazzilli commented 1 year ago

@KELSEYDOWLING7 no, the shipping fix is still under review from the team and will not be ready by tonight according to Davin

jhflorey commented 1 year ago

@brotzmanmj @Davinkjohnson @KELSEYDOWLING7 i'll update code for deleting bioSurvey_v1, clinicalBioSurvey_v1, covid19Survey_v1, menstrualSurvey_v1, module1_v1 module1_v2, module2_v1, module2_v2, module3_v1, module4_v1, ssn, biospecimen now. And the boxes table, i will work on it once we have a final decision.

@jhflorey we previously did not discuss the notifications table. We need to make sure the records related to a data destroyed participant are also removed from notifications. (Use token on this table for the key to participants.)

@Davinkjohnson Just merged my code changes into dev.

jeannewu commented 1 year ago

Thanks, @jhflorey I will help Kelsey check the data and let you know whether the data is deleted in these BQ tables in dev soon.

kmazzilli commented 1 year ago

Hi all - I was able to ship the samples in Box 76 and go ahead with the data destruction request for connect_id: 2361618927

jeannewu commented 1 year ago

@Davinkjohnson Thank you very much for adding me in this chat, Davin. I will be back up for Kelsey's work next week. Please let me know if anything is needed on my side. Thanks a lot. Right now, I have checked the data destruction progress in dev for that Connect_ID 2361618927 all the data are still not deleted yet. I will check all related data in dev again and let you know whether the data are deleted tomorrow. Thanks a lot.

brotzmanmj commented 1 year ago

Thanks King, @jhflorey please check the data tomorrow in Firestore. Jing will check it in BQ. @jeannewu when you check it, please check that all the data sources are gone (including both biospecimen records, all surveys, SSN, etc), and also make sure all of the stub variables remain. Thanks.

jeannewu commented 1 year ago

@brotzmanmj Got it. I will. Thanks

jhflorey commented 1 year ago

looks like my code changes not work in dev. The data are still not deleted yet. Let me check it.

jeannewu commented 1 year ago
@jhflorey I have just checked BQ data in dev: bioSurvey_v1_JP, module1_v1_JP, module2_v1_JP, and menstrualSurvey_v1_JP don't have the data for Connect_ID=236168927. But the other tables (shown below), this participant data are still there. Connect_ID = '2361618927' n columns rows
bioSurvey_v1_JP 380 0
biospecimen_J 318 2
clinicalBioSurvey_v1_JP 260 1
covid19Survey_v1_JP 193 1
module1_v1_JP 1506 0
module1_v2_JP 1902 1
module2_v1_JP 705 0
module2_v2_JP 739 1
module3_v1_JP 364 1
module4_v1_JP 1302 1
jhflorey commented 1 year ago

@jeannewu yup the stats is correct. I had the PR to fix it. Will inform to you once it's merged into dev.

jhflorey commented 1 year ago

@brotzmanmj @jeannewu my code changes for fixing issue above already merged into dev. Do you want me to do manual trigger the job for running now or will you wait for the job to run at 1am and do the test tomorrow.

jeannewu commented 1 year ago

@jhflorey thank you very much for your asking. I am quite flexible to check on my end. And the BQ1 (flattened data) data in GCP are scheduled as updated/refreshed once a day (every morning around 10 am). If you would like me to check today, the original data .connect. in dev would be the better option. Otherwise, I will wait for recheck the FlatConnect in dev tomorrow. How about you?

jhflorey commented 1 year ago

@jeannewu i will manually run the data destruction job for testing my code changes in dev.

jhflorey commented 1 year ago

@jeannewu the firestore data was clean after running job manually.

brotzmanmj commented 1 year ago

Thanks to you both. @jeannewu can you check the list of stub variables and make sure they are all still there? And @kmazzilli can you check MyConnect and also the SMDB and make sure only the stub variables remain, the forms are all still accessible with the correct signatures, and the notifications are gone from both the MyConnect and the SMDB?

brotzmanmj commented 1 year ago

And Kaitlyn, we should also check the biospecimen dashboard and see if the data are gone from there (and the Box data is not gone). We can look at that together if you like, i'm not sure exactly how we'll check but there are a couple of potential ways.

jeannewu commented 1 year ago

@brotzmanmj @jhflorey I will check them right now and let you know if data is deleted or not.

jhflorey commented 1 year ago

@brotzmanmj @jeannewu i noticedthe Middle Name extracted from HIPAA revocation form and Middle Name extracted from Data Destruction form still exist. image

consent middlename and consent suffix does not exist. I do not know if we have input or not.

jeannewu commented 1 year ago

@jhflorey Just checked that All the data of this participant are deleted from dev.Connect. But I think as Kelsey told me before, some part of the participant data of this Connect_ID should be kept in the .Connect.participants table "SELECT d_471168198,d_736251808,d_436680969,d_480305327,d_564964481,d_795827569,d_544150384,d_371067537,d_454205108,d_454445267,d_919254129, d_412000022,d_558435199,d_262613359,d_821247024, d_914594314, d_747006172, d_659990606, d_299274441.d_299274441,d_919699172,d_141450621,d_576083042,d_431428747,d_121430614,d_523768810,d_639172801,d_175732191,d_150818546,d_624030581,d_285488731,d_596510649,d_866089092,d_990579614,d_131458944,d_372303208,d_777719027,d_620696506, d_352891568,d_958588520,d_875010152,d_404289911,d_637147033,d_734828170,d_715390138,d_538619788,d_153713899, d_613641698,d_407743866,d_831041022,d_269050420,d_359404406,d_119449326,d_304438543,d_912301837,d_130371375.d_266600170.d_731498909, d_130371375.d_303552867.d_731498909, d_130371375.d_496823485.d_731498909, d_130371375.d_650465111.d_731498909, d_130371375.d_266600170.d_787567527, d_130371375.d_266600170.d_222373868, d_130371375.d_303552867.d_222373868, d_130371375.d_496823485.d_222373868, d_130371375.d_650465111.d_222373868, d_130371375.d_266600170.d_648936790, d_130371375.d_303552867.d_648936790, d_130371375.d_496823485.d_648936790, d_130371375.d_650465111.d_648936790, d_130371375.d_266600170.d_297462035, d_130371375.d_303552867.d_297462035, d_130371375.d_496823485.d_297462035, d_130371375.d_650465111.d_297462035, d_130371375.d_266600170.d_648228701, d_130371375.d_303552867.d_648228701, d_130371375.d_496823485.d_648228701, d_130371375.d_650465111.d_648228701, d_130371375.d_266600170.d_438636757, d_130371375.d_303552867.d_438636757, d_130371375.d_496823485.d_438636757, d_130371375.d_650465111.d_438636757,d_765336427,d_479278368,d_826240317,d_693626233,d_104278817,d_744604255, d_268665918,d_592227431,d_399159511,d_231676651,d_996038075,d_506826178,d_524352591.d_524352591, d_524352591.d_902332801, d_524352591.d_902332801,d_299274441.d_457532784,d_773707518,d_577794331,d_883668444 FROM nih-nci-dceg-connect-dev.Connect.participants WHERE Connect_ID= 236168927", right? if yes, this part of the data of this connect_ID are not available now?

jeannewu commented 1 year ago

@brotzmanmj @kmazzilli may you show me how to check the box tables on the data destructions on biospecimen data of this connect_ID? The boxes table is not linked by Connect_ID, but by the box number which contains the biospecimen ID (if my descriptions are correct)?

jhflorey commented 1 year ago

@jeannewu we dont remove any data in box table as this confirmation https://github.com/episphere/connect/issues/658#issuecomment-1684289146

jhflorey commented 1 year ago

@jhflorey Just checked that All the data of this participant are deleted from dev.Connect. But I think as Kelsey told me before, some part of the participant data of this Connect_ID should be kept in the .Connect.participants table "SELECT d_471168198,d_736251808,d_436680969,d_480305327,d_564964481,d_795827569,d_544150384,d_371067537,d_454205108,d_454445267,d_919254129, d_412000022,d_558435199,d_262613359,d_821247024, d_914594314, d_747006172, d_659990606, d_299274441.d_299274441,d_919699172,d_141450621,d_576083042,d_431428747,d_121430614,d_523768810,d_639172801,d_175732191,d_150818546,d_624030581,d_285488731,d_596510649,d_866089092,d_990579614,d_131458944,d_372303208,d_777719027,d_620696506, d_352891568,d_958588520,d_875010152,d_404289911,d_637147033,d_734828170,d_715390138,d_538619788,d_153713899, d_613641698,d_407743866,d_831041022,d_269050420,d_359404406,d_119449326,d_304438543,d_912301837,d_130371375.d_266600170.d_731498909, d_130371375.d_303552867.d_731498909, d_130371375.d_496823485.d_731498909, d_130371375.d_650465111.d_731498909, d_130371375.d_266600170.d_787567527, d_130371375.d_266600170.d_222373868, d_130371375.d_303552867.d_222373868, d_130371375.d_496823485.d_222373868, d_130371375.d_650465111.d_222373868, d_130371375.d_266600170.d_648936790, d_130371375.d_303552867.d_648936790, d_130371375.d_496823485.d_648936790, d_130371375.d_650465111.d_648936790, d_130371375.d_266600170.d_297462035, d_130371375.d_303552867.d_297462035, d_130371375.d_496823485.d_297462035, d_130371375.d_650465111.d_297462035, d_130371375.d_266600170.d_648228701, d_130371375.d_303552867.d_648228701, d_130371375.d_496823485.d_648228701, d_130371375.d_650465111.d_648228701, d_130371375.d_266600170.d_438636757, d_130371375.d_303552867.d_438636757, d_130371375.d_496823485.d_438636757, d_130371375.d_650465111.d_438636757,d_765336427,d_479278368,d_826240317,d_693626233,d_104278817,d_744604255, d_268665918,d_592227431,d_399159511,d_231676651,d_996038075,d_506826178,d_524352591.d_524352591, d_524352591.d_902332801, d_524352591.d_902332801,d_299274441.d_457532784,d_773707518,d_577794331,d_883668444 FROM nih-nci-dceg-connect-dev.Connect.participants WHERE Connect_ID= 236168927", right? if yes, this part of the data of this connect_ID are not available now?

@jeannewu not sure about your process. this is my stub records list.

[ "query", "pin", "token", "state", "Connect_ID", "471168198", "736251808", "436680969", "480305327", "564964481", "795827569", "544150384", "371067537", "454205108", "454445267", "919254129", "412000022", "558435199", "262613359", "821247024", "914594314", "747006172", "659990606", "299274441", "919699172", "141450621", "576083042", "431428747", "121430614", "523768810", "639172801", "175732191", "150818546", "624030581", "285488731", "596510649", "866089092", "990579614", "131458944", "372303208", "777719027", "620696506", "352891568", "958588520", "875010152", "404289911", "637147033", "734828170", "715390138", "538619788", "153713899", "613641698", "407743866", "831041022", "269050420", "359404406", "119449326", "304438543", "912301837", "130371375", "765336427", "479278368", "826240317", "693626233", "104278817", "744604255", "268665918", "592227431", "399159511", "231676651", "996038075", "506826178", "524352591", "902332801", "457532784", "773707518", "577794331", "883668444", "827220437", "699625233", ]

Or you can refrer to https://nih.app.box.com/file/1255095111396

jeannewu commented 1 year ago

@jhflorey See what I checked on "SELECT * FROM nih-nci-dceg-connect-dev.Connect.participants WHERE Connect_ID= 236168927" is shown as "There is no data to display."

brotzmanmj commented 1 year ago

@jeannewu 'i noticed the Middle Name extracted from HIPAA revocation form and Middle Name extracted from Data Destruction form still exist.' These are supposed to still exist. They should be on the stub variables list.

kmazzilli commented 1 year ago

Hi all - I checked SMDB and there was one issue for connect_id: 2361618927. I could not download the original HIPAA and consent agreement forms. Instead I received an error message that said "An error has occured generating the pdf please contact support". Otherwise, everything else looked good - I was able to download the data destruction and HIPAA revocation forms and the signatures were correct on SMDB, I was able to access all 4 forms in MyConnect, the notifications are gone from both the MyConnect and the SMDB, and the correct variables were the only ones remaining in SMDB.

jeannewu commented 1 year ago

@brotzmanmj I think @jhflorey manually updated the Connect.participants table in dev not the FlatConnect.participants_JP table. All the data of this participant in Connect are removed including the ones in participants table. But since the flattened tables in "FlatConnect" are not updated yet, all the data of this participant are still the original ones before @jhflorey manually updated her code to firestorm.

kmazzilli commented 1 year ago

hi @jeannewu, since it's been more than an hour, are you able to confirm that the data in BQ has been updated?

jeannewu commented 1 year ago

@jhflorey I've just checked that the data of this connect_ID =2361618927 is not in the nih-nci-dceg-connect-dev.Connect.###. datasets, but they are all in the nih-nci-dceg-connect-dev.FlatConnect. ###.

jeannewu commented 1 year ago

@kmazzilli. I 've just checked that the data of this connect_ID =2361618927 is not in the nih-nci-dceg-connect-dev.Connect.###. datasets, but they are all in the nih-nci-dceg-connect-dev.FlatConnect. ###.

brotzmanmj commented 1 year ago

Hi Jing, can you explain what that means?

jeannewu commented 1 year ago

@brotzmanmj all in the nih-nci-dceg-connect-dev.FlatConnect. ###. are the one updated this morning at 9:30-10am as scheduled daily. But the data in the nih-nci-dceg-connect-dev.Connect. ###. are synchronized with the ones in Firestore, which might updated hourly. So after @jhflorey manually updated her data destruction code this afternoon in the Firestore, all the data in "nih-nci-dceg-connect-dev.Connect. ###" are updated with the impacts in the firestorm by her code.

jeannewu commented 1 year ago

@brotzmanmj all the participant connect_ID =2361618927 are all deleted from .connect.tables including the ones which should not be deleted in the participants table as Kelsey told me before. Is this what you want for the data destruction on this participant?

brotzmanmj commented 1 year ago

Thanks @jeannewu So you're saying that the data in BQ that get updated hourly have been deleted, but there are lingering data somewhere in BQ that get extracted/flattened once a day at ~9:30am and those we should expect will be deleted tomorrow morning?

jeannewu commented 1 year ago

@brotzmanmj yes. But how about these informations on the HIPAA, refusal and withdrawal, etc. on this Connect_ID? Should these also be deleted from participants table? d_471168198,d_736251808,d_436680969,d_480305327,d_564964481,d_795827569,d_544150384,d_371067537,d_454205108,d_454445267,d_919254129, d_412000022,d_558435199,d_262613359,d_821247024, d_914594314, d_747006172, d_659990606, d_299274441.d_299274441,d_919699172,d_141450621,d_576083042,d_431428747,d_121430614,d_523768810,d_639172801,d_175732191,d_150818546,d_624030581,d_285488731,d_596510649,d_866089092,d_990579614,d_131458944,d_372303208,d_777719027,d_620696506, d_352891568,d_958588520,d_875010152,d_404289911,d_637147033,d_734828170,d_715390138,d_538619788,d_153713899, d_613641698,d_407743866,d_831041022,d_269050420,d_359404406,d_119449326,d_304438543,d_912301837,d_130371375.d_266600170.d_731498909, d_130371375.d_303552867.d_731498909, d_130371375.d_496823485.d_731498909, d_130371375.d_650465111.d_731498909, d_130371375.d_266600170.d_787567527, d_130371375.d_266600170.d_222373868, d_130371375.d_303552867.d_222373868, d_130371375.d_496823485.d_222373868, d_130371375.d_650465111.d_222373868, d_130371375.d_266600170.d_648936790, d_130371375.d_303552867.d_648936790, d_130371375.d_496823485.d_648936790, d_130371375.d_650465111.d_648936790, d_130371375.d_266600170.d_297462035, d_130371375.d_303552867.d_297462035, d_130371375.d_496823485.d_297462035, d_130371375.d_650465111.d_297462035, d_130371375.d_266600170.d_648228701, d_130371375.d_303552867.d_648228701, d_130371375.d_496823485.d_648228701, d_130371375.d_650465111.d_648228701, d_130371375.d_266600170.d_438636757, d_130371375.d_303552867.d_438636757, d_130371375.d_496823485.d_438636757, d_130371375.d_650465111.d_438636757,d_765336427,d_479278368,d_826240317,d_693626233,d_104278817,d_744604255, d_268665918,d_592227431,d_399159511,d_231676651,d_996038075,d_506826178,d_524352591.d_524352591, d_524352591.d_902332801, d_524352591.d_902332801,d_299274441.d_457532784,d_773707518,d_577794331,d_883668444 FROM nih-nci-dceg-connect-dev.Connect.participants WHERE Connect_ID= 236168927", right?

brotzmanmj commented 1 year ago

@jhflorey can you comment on that? are these the stub record variables?

kmazzilli commented 1 year ago

hi everyone - Michelle and I checked out the biospecimen dashboard for connect_id: 2361618927 and everything looked correct - we could see the Box information and under the participation search only the stub variables and a red x under the status

jeannewu commented 1 year ago

@jhflorey @kmazzilli @brotzmanmj @Davinkjohnson Thank you very much @Davinkjohnson. I had a typo in the Connect_ID which caused such a big confusion. All the data on this Connect_ID have been correctly removed from the participant table. I will double check them tomorrow in the BQ tables again.