In this tutorial we will use PySyft to study heart disease, and by doing so we will try to answer the following question:
Can we run Machine Learning experiments on multiple and distributed medical datasets, without seeing the data?
We are going to to learn how! All you need to get started is PySyft, and a Jupyter notebook! ๐
Using the git
command from the terminal:
$ git clone https://github.com/openmined/syft-heart-disease-tutorial
or by clicking on Code >> Local >> Download ZIP
on the repository main page.
The repository includes a requirements.txt
file with the list of
all the Python packages required to work with the notebooks.
You can install all these dependencies using pip
:
$ pip install -r requirements.txt
Please refer to the Quick Install guide to learn how to install PySyft.
Note: It is recommended to install PySyft and all the dependencies within a dedicated Python virtual environment (using the virtual-env manager of choice, e.g. Miniconda, pyenv)
Setup and launch the PySyft Datasites using the launch_datasites.py
script included in the repository. From the command line:
$ python launch_datasites.py
Note: Please, keep the terminal open, as this will keep all the servers running in background. You can stop all the servers, and terminate
the program by typing Ctrl+C
.
4x
more medical data in training. (๐๐๐)We will use the full version of the Heart Disease dataset, as available on UCI ML.
This database is the result of a study for the diagnosis of coronary artery disease, as presented in this paper.
The full dataset contains the data as collected by patients in four different hospitals, in 1988:
Each Hospital will correspond to a single PySyft Datasite, hosting their corresponding version of the Heart Study Data
.
This dataset is quite popular, and well-known in the data science/machine learning community. However, only the Cleveland database is the one that has been effectively used by ML researchers to date 1. The "target" field refers to the presence of heart disease in the patient. It is integer valued from 0
(no presence) to 4
. In our Machine learning experiments we will treat this problem as a binary (presence
vs absence
) classification problem.
The authors of the dataset have requested that any use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
If you spot any error or mistake, please feel free to reach out directly to me via email, or to open an Issue on the repository.
Any feedback will be very much appreciated! Thank you! ๐
For any technical question, or clarification, or any request for assistance with PySyft, please consider
joining the OpenMined slack, and pop your question in the #support
channel.
Author: Valerio Maggio (@leriomaggio
),
Researcher, SSI Fellow,
and Education Team @ Open Mined.
All the Code material is distributed under the terms of the Apache License. See LICENSE file for additional details.
All the instructional materials in this repository are free to use, and made available under the Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.
You are free:
for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution --- You must give appropriate credit, and provide a link to the
LICENSE cc-by-human
,
and indicate if changes were made.
You may do so in any reasonable manner, but not in any way that suggests the
licensor endorses you or your use.
No additional restrictions --- You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.