Process for internally exposing real patient data

KCB13 commented 7 years ago

[ ] Request a SQL server (where real patient data will live) - School of Medicine

@empfff - please update issue if information is not accurate

empfff commented 7 years ago

Request made; solution will be much more complicated than just getting a SQL Server, though. Will need to update group at next meeting rather than attempting to describe here.

KCB13 commented 7 years ago

@StanAhalt decision needed to move forward with this issue.

StanAhalt commented 7 years ago

I am not clear on what needs to be decided. Seems likeEmily needs a more fulsome discussion, correct>

KCB13 commented 7 years ago

@empfff @stevencox I think this issue can be closed and refer to #90. Agree?

stevencox commented 7 years ago

Hi Kira,

I think we want both -

#90 is about deploying SAML security.

#77 is more about a new set of machines and a network design Emily and team are deploying.

So there's an effort needed to deploy services within the new set of machines Emily's procured.

And a separate effort to build a prototype SAML interface for APIs.

Emily, let me know if you see that differently.

Thanks,

Steve

From: Kira Bradford notifications@github.com Sent: Monday, September 18, 2017 9:28 AM To: ResearchSoftwareInstitute/greendatatranslator Cc: Steven Cox; Mention Subject: Re: [ResearchSoftwareInstitute/greendatatranslator] Process for internally exposing real patient data (#77)

@empfffhttps://github.com/empfff @stevencoxhttps://github.com/stevencox I think this issue can be closed and refer to #90https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/90. Agree?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/77#issuecomment-330220432, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AASvGyAtIJaboDcpwnRMG9ipd4otTsnGks5sjm_wgaJpZM4OQ9J5.

KCB13 commented 7 years ago

Meeting on October 19 with Jan Werner, @jameschump, @xu-hao, @stevencox, @rayi113 We discussed thee primary goals

Serving HuSH+ data via an API [current state]
Computing endotype machine learning models with real data
Serving real patient data via an API

And developed a notional architecture extending Emily's.

Convert Tweetsie to a computation node.
Create a second virtual machine to host the current HuSH+ API.
Develop the real patient data architecture adding SAML authentication and a second virtual machine.

Observations:

Separation of function is one security consideration driving the use of separate machines.
Tweetsie is overpowered for API serving
With more disk, it could be useful for real patient data computation
The timeframe for the activities below is post-hackathon

@empfff Emily, how does this look to you? And what should be our next steps before taking a plan to Ken?

Additional Info: Status Quo Tweetsie's current configuration looks like this: Basic Serial #: 91GKSW1 OS: CentOS 7 256G RAM 32 cores Services Clinical API endpoint apache tomcat postgres mariadb - not used Storage Current data for generating ML models: Exposures: 250G Clinical: 30G (@16K patients/asthma only) Future needs Exposures: current x 5 Clinical: current x 10 (@160K patients/asthma only)

Draft Proposal HuSH: Create a new virtual machine to serve the current HuSH+ and later Real data (app) Emily: Do we want to run this on the med center's virtual machine infrastructure? Specs like these should work: (James note I bumped down cores as VMs tend to perform poorly with many) RAM: 16GB CPU: 4 Storage: 4TB of disk OS: CentOS7 Firewall: Open port 443 (SSL) and 22 (SSH) IP Restriction: Restrict access to requests from translator.ncats.io (as with Tweetsie) Compute: Make Tweetsie a Compute Node Design a usage protocol describing How users will interact with the machine How much time will be spent on the machine Data egress and access (what kinds of data and by what mechanisms) Submit this to Emily, Ken Langley, et al Upgrade Tweetsie Hardware Install 2 x 6TB @ RAID-1 disks Restrict access by IP address to Hao and Kimberly Eliminate outbound internet connectivity Use scp for file transfer, ssh for shell access. Only port 22 should be open Request two factor authentication via Duo and ssh on that machine Real: Develop an architecture to serve real data Create a second virtual machine (db). Again, Emily, do we want to run this at the med center? Secure appropriate approval from security team Configuration Employ same controls as on first virtual machine: Control access via IP Address Disk encryption Open ports required for 5432 (Postgres) and 22 (SSH) Services On db: Postgres On app: A new instance of the J2EE Spring clinical application, pointing at db's Postgres instance. On app: Create a new virtual host for serving real data

Configure SAML

Onboard contractor in November

Configure the SAML SP

Connect the SP to the test UNC IdP

Develop a programmatic prototype of invoking the clinical API via the SP Validate the SAML pipeline against the HuSH+ endpoint for testing purposes Move SAML SP configuration to final destination (app or a separate machine) Obtain approval to test from an external IP Address Test

stevencox commented 7 years ago

?The formatting was the best part of that email and now it's gone. Please see the attached doc for easier reading.

From: Kira Bradford notifications@github.com Sent: Friday, October 20, 2017 9:07 AM To: ResearchSoftwareInstitute/greendatatranslator Cc: Steven Cox; Mention Subject: Re: [ResearchSoftwareInstitute/greendatatranslator] Process for internally exposing real patient data (#77)

Meeting on October 19 with Jan Werner, @jameschumphttps://github.com/jameschump, @xu-haohttps://github.com/xu-hao, @stevencoxhttps://github.com/stevencox, @rayi113https://github.com/rayi113 We discussed thee primary goals

?Serving HuSH+ data via an API [current state]
Computing endotype machine learning models with real data
Serving real patient data via an API

And developed a notional architecture extending Emily's.

?Convert Tweetsie to a computation node.
Create a second virtual machine to host the current HuSH+ API.
Develop the real patient data architecture adding SAML authentication and a second virtual machine.

Observations:?

Separation of function is one security consideration driving the use of separate machines.??
?Tweetsie is overpowered for API serving
With more disk, it could be useful for real patient data computation
The timeframe for the activities below is post-hackathon

@empfffhttps://github.com/empfff Emily, how does this look to you? And what should be our next steps before taking a plan to Ken?

Additional Info: Status Quo Tweetsie's current configuration looks like this: ?Basic Serial #: 91GKSW1 OS: CentOS 7 256G RAM 32 cores Services Clinical API endpoint apache tomcat postgres mariadb - not used ?Storage Current data for generating ML models: Exposures: 250G Clinical: 30G (@16khttps://github.com/16k patients/asthma only) ?Future needs ?Exposures: current x 5 Clinical: current x 10 (@160k patients/asthma only)

?Draft Proposal HuSH: Create a new virtual machine to serve the current HuSH+ and later Real data (app) Emily: Do we want to run this on the med center's virtual machine infrastructure? Specs like these should work: (James note I bumped down cores as VMs tend to perform poorly with many) ??RAM: 16GB CPU: 4 Storage: 4TB of disk OS: CentOS7 Firewall: Open port 443 (SSL) and 22 (SSH) IP Restriction: Restrict access to requests from translator.ncats.io (as with Tweetsie) ?Compute: Make Tweetsie a Compute Node ?Design a usage protocol describing ?How users will interact with the machine ??How much time will be spent on the machine Data egress and access (what kinds of data and by what mechanisms) Submit this to Emily, Ken Langley, et al Upgrade Tweetsie Hardware ?Install 2 x 6TB @ RAID-1 disks Restrict access by IP address to Hao and Kimberly Eliminate outbound internet connectivity Use scp for file transfer, ssh for shell access. Only port 22 should be open Request two factor authentication via Duo and ssh on that machine Real: Develop an architecture to serve real data Create a second virtual machine (db). ?Again, Emily, do we want to run this at the med center? Secure appropriate approval from security team Configuration Employ same controls as on first virtual machine: Control access via IP Address Disk encryption Open ports required for 5432 (Postgres) and 22 (SSH) ?Services On db: ?Postgres On app: A new instance of the J2EE Spring clinical application, pointing at db's Postgres instance. On app: Create a new virtual host for serving real data

?Configure SAML

?Onboard contractor in November

Configure the SAML SP

Connect the SP to the test UNC IdP

Develop a programmatic prototype of invoking the clinical API via the SP Validate the SAML pipeline against the HuSH+ endpoint for testing purposes Move SAML SP configuration to final destination (app or a separate machine) ?Obtain approval to test from an external IP Address Test

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/77#issuecomment-338201095, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AASvG6Ecd-6nurVAYm5G0RQY1BUT68H5ks5suJsMgaJpZM4OQ9J5.

ResearchSoftwareInstitute / greendatatranslator

Process for internally exposing real patient data #77