OpenMined / syft-heart-disease-tutorial

Apache License 2.0
2 stars 1 forks source link
Syft Logo

Study Heart Disease using PySyft

Welcome!

In this tutorial we will use PySyft to study heart disease, and by doing so we will try to answer the following question:

Can we run Machine Learning experiments on multiple and distributed medical datasets, without seeing the data?

We are going to to learn how! All you need to get started is PySyft, and a Jupyter notebook! ๐Ÿš€

Related posts on OpenMined Blog:

  1. Need more medical data? A Python package and an email is all you need!.

  2. Federated Learning in 10 lines of Code, with PySyft.

Table of Content

Getting Started

1. Download the code locally

Using the git command from the terminal:

$ git clone https://github.com/openmined/syft-heart-disease-tutorial

or by clicking on Code >> Local >> Download ZIP on the repository main page.

2. Install PySyft and ML Packages

The repository includes a requirements.txt file with the list of all the Python packages required to work with the notebooks. You can install all these dependencies using pip:

$ pip install -r requirements.txt

Please refer to the Quick Install guide to learn how to install PySyft.

Note: It is recommended to install PySyft and all the dependencies within a dedicated Python virtual environment (using the virtual-env manager of choice, e.g. Miniconda, pyenv)

3. Launch the Datasites

Setup and launch the PySyft Datasites using the launch_datasites.py script included in the repository. From the command line:

$ python launch_datasites.py

Note: Please, keep the terminal open, as this will keep all the servers running in background. You can stop all the servers, and terminate the program by typing Ctrl+C.

Table of Content

Data Description

We will use the full version of the Heart Disease dataset, as available on UCI ML.

This database is the result of a study for the diagnosis of coronary artery disease, as presented in this paper.

The full dataset contains the data as collected by patients in four different hospitals, in 1988:

Each Hospital will correspond to a single PySyft Datasite, hosting their corresponding version of the Heart Study Data.

Notes

This dataset is quite popular, and well-known in the data science/machine learning community. However, only the Cleveland database is the one that has been effectively used by ML researchers to date 1. The "target" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. In our Machine learning experiments we will treat this problem as a binary (presence vs absence) classification problem.

Acknowledgments

The authors of the dataset have requested that any use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:

Feedback and Support

If you spot any error or mistake, please feel free to reach out directly to me via email, or to open an Issue on the repository.

Any feedback will be very much appreciated! Thank you! ๐Ÿ™

Any question about PySyft?

For any technical question, or clarification, or any request for assistance with PySyft, please consider joining the OpenMined slack, and pop your question in the #support channel.

Colophon

Author: Valerio Maggio (@leriomaggio), Researcher, SSI Fellow, and Education Team @ Open Mined.

All the Code material is distributed under the terms of the Apache License. See LICENSE file for additional details.

All the instructional materials in this repository are free to use, and made available under the Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.

You are free:

for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms: