KatherLab / swarm-learning-hpe

Experimental repo for Odelia project based on HPE platform. This repo contains multiple models for histopathology and radiology training.
MIT License
12 stars 1 forks source link
blockchain deep-learning distributed-systems

swarm-learning-hpe

standard-readme compliant

Swarm learning based on HPE platform, experiments performed based on HPE Swarm Learning version number 2.2.0

This repository contains:

  1. SWARM Learning For Histopathology Image Analysis and Radiology Image Analysis
  2. Work flow to help keep track of what's under process.
  3. Issue section where people can dump ideas and raise questions encountered when using this repo.
  4. Working version of marugoto_mri for Attention MIL based model, originally suitable for histopathology images but Marta has modified it to work with MRI images.
  5. Working version of odelia-breast-mri for 3D-CNN model by @Gustav.

Table of Contents

Background

Brief description about HPE platform

Course of Swarm Leaning explained in a generally understandable way: https://learn.software.hpe.com/swarm-learning-essentials

HPE Swarm Learning extends the concept of federated learning to decentralized learning by adding functionality that obviates the need for a central leader. It combines the use of AI, edge computing, and blockchain.

HPE Swarm Learning is a decentralized, privacy-preserving Machine Learning (ML) framework. Swarm Learning framework uses the computing power at, or near, the distributed data sources to run the ML algorithms that train the models. It uses the security of a blockchain platform to share learning with peers safely and securely. In Swarm Learning, training of the model occurs at the edge, where data is most recent, and where prompt, data-driven decisions are mostly necessary. In this decentralized architecture, only the insights learned are shared with the collaborating ML peers, not the raw data. This tremendously enhances data security and privacy.

The following image provides an overview of the Swarm Learning framework. Only the model parameters (learnings) are exchanged between the various edge nodes and not the raw data. This ensures that the privacy of data is preserved. img.png

This is the Swarm Learning framework: sl_node_structure.png

Install

Prerequisites

Hardware recommendations

Operating System

Upgrade the Swarm Learning Environment from Older Version

  1. Run the following command to upgrade the Swarm Learning Environment from 1.x.x to 2.x.x
    sh workspace/automate_scripts/server_setup/cleanup_old_sl.sh

    Then proceed to 1. Prerequisite in Setting up the Swarm Learning Environment

Setting up the user and repository

  1. Create a user named "swarm" and add it to the sudoers group. Login with user "swarm".

    sudo adduser swarm
    sudo usermod -aG sudo swarm
    sudo su - swarm
  2. Add the Docker user to the sudoers group

    sudo usermod -aG docker swarm

    After running this command, you will need to log out and log back in for the changes to take effect, or you can use the newgrp command like so:

    newgrp docker
  3. Run the following commands to set up the repository:

cd / && sudo mkdir opt/hpe && cd opt/hpe && sudo chmod 777 -R /opt/hpe
git clone https://github.com/KatherLab/swarm-learning-hpe.git && cd swarm-learning-hpe
  1. Install cuda environment and nvidia drivers, as soon as you could see correct outputs of the following command you may proceed.
    nvidia-smi

    Please disable secure boot. On some systems, Secure Boot might prevent unsigned kernel modules (like NVIDIA's) from loading. Check Loaded Kernel Modules:

    • To see if the NVIDIA kernel module is loaded:
      lsmod | grep nvidia

      Review System Logs:

    • Sometimes, system logs can provide insights into any issues with the GPU or driver:
      dmesg | grep -i nvidia

      Manually Load the NVIDIA Module:

    • You can try manually loading the NVIDIA kernel module using the modprobe command:
      sudo modprobe nvidia

      Requirements and dependencies will be automatically installed by the script mentioned in the following section.

Setting up the Swarm Learning Environment

PLEASE REPLACE THE <PLACEHOLDER> WITH THE CORRESPONDING VALUE!

<sentinel_ip> = 172.24.4.67 currently it's the IP assigned by VPN server for TUD host.

<host_index> = Your institute's name. For ODELIA project should be chosen from TUD Ribera VHIO Radboud UKA Utrecht Mitera Cambridge Zurich

<workspace_name> = The name of the workspace you want to work on. You can find available modules under workspace/ folder. Each module corresonds to a radiology model. Currently we suggest to use odelia-breast-mri or marogoto_mri here.

Please only proceed to the next step by observing "... is done successfully" from the log

  1. Optional: download preprocessed datasets. Please refer to the Data Preparation section for more details.

  2. Prerequisite: Runs scripts that check for required software and open/exposed ports.

    sh workspace/automate_scripts/automate.sh -a
  3. Server setup: Runs scripts that set up the swarm learning environment on a server.

    sh workspace/automate_scripts/automate.sh -b -s <sentinel_ip> -d <host_index>
  4. Final setup: Runs scripts that finalize the setup of the swarm learning environment. Only <> is required. The [-n num_peers] and [-e num_epochs] flags are optional.

    sh workspace/automate_scripts/automate.sh -c -w <workspace_name> -s <sentinel_ip> -d <host_index> [-n num_peers] [-e num_epochs]

Optional 5. Reconnect to VPN

sh /workspace/automate_scripts/server_setup/setup_vpntunnel.sh

In case your machine got restarted or lost the vpn connection. Here is the guide to reconnect: VPN connect guide The file.ovpn is the config file that TUD assigned to you.

If a problem is encountered, please observe this README.md file for step-by-step setup. Specific instructions are given about how to run the commands. All the processes are automated, so you can just run the above command and wait for the process to finish.

If any problem occurs, please first try to figure out which step is going wrong, try to google for solutions and find solution in Troubleshooting.md. Then contact the maintainer of the Swarm Learning Environment and document the error in the Troubleshooting.md file.

Usage

Ensuring Dataset Structure

To ensure proper organization of your dataset, please follow the steps outlined below:

  1. Directory Location

    Place your dataset under the specified path:

/workspace/odelia-breast-mri/user/data-and-scratch/data

Within this path, create a folder named multi_ext. Your directory structure should then resemble: /opt/hpe/swarm-learning-hpe/workspace/odelia-breast-mri/user/data-and-scratch/data └── multi_ext ├── datasheet.csv # Your clinical tabular data ├── test # External validation dataset ├── train_val # Your own site training data └── segmentation_metadata_unilateral.csv # External validation table

  1. Data Organization

Inside the train_val or test directories, place folders that directly contain NIfTI files. The folders should be named according to the following convention:

_right _left Here, `` should correspond with the patient ID in your tables (`datasheet.csv` and `segmentation_metadata_unilateral.csv`). This convention assists in linking the imaging data with the respective clinical information efficiently. #### Summary - **Step 1:** Ensure your dataset is placed within `/workspace/odelia-breast-mri/user/data-and-scratch/data/multi_ext`. - **Step 2:** Organize your clinical tabular data, external validation dataset, your own site training data, and external validation table as described. - **Step 3:** Name folders within `train_val` and `test` as `_right` or `_left`, matching the patient IDs in your datasheets. Following these structured steps will help in maintaining a well-organized dataset, thereby enhancing data management and processing in your projects. ### Running Swarm Learning Nodes To run a Swarm Network node -> Swarm SWOP Node -> Swarm SWCI node. Please open a terminal for each of the nodes to run. Observe the following commands: #### SN - To run a Swarm Network (or sentinel) node: ```sh ./workspace/automate_scripts/launch_sl/run_sn.sh -s -d ``` or ```sh runsn ``` #### SWOP - To run a Swarm SWOP node: ```sh ./workspace/automate_scripts/launch_sl/run_swop.sh -w -s -d ``` or ```sh runswop ``` #### SWCI - To run a Swarm SWCI node(SWCI node is used to generate training task runners, could be initiated by any host, but currently we suggest only the sentinel host is allowed to initiate): ```sh ./workspace/automate_scripts/launch_sl/run_swci.sh -w -s -d ``` or ```sh runswci ``` - To check the logs from training: ```sh ./workspace/automate_scripts/launch_sl/check_latest_log.sh ``` or ```sh cklog [--ml] [--swci] [--swop] [--sn] ``` - To stop the Swarm Learning nodes, --[node_type] is optional, if not specified, all the nodes will be stopped. Otherwise, could specify --sn, --swop for example. ```sh ./workspace/swarm_learning_scripts/stop-swarm --[node_type] ``` or ```sh stopswarm [--node_type] ``` - To view results, see logs under `workspace//user/data-and-scratch/scratch` Please observe [Troubleshooting.md](Troubleshooting.md) section 10 for successful running logs of swop and sn nodes. The process will keep running and displaying logs so you will need to open a new terminal to run other commands. * Typical run time can take 3 hours for experiments trained on the DUKE dataset with ResNet50-3D with three nodes involved. ## Workflow ![Workflow.png](assets%2FWorkflow.png) ![Swarm model training protocol .png](assets%2FSwarm%20model%20training%20protocol%20.png) ## Node list Nodes will be added to vpn and will be able to communicate with each other after setting up the Swarm Learning Environment with [Install](#install) | Project | Node Name | Location | Hostname | Data | Maintainer | | ------- | --------- | ------------------| ---------| --------- | ------------------------------------------| | Sentinel node | TUD | Dresden, Germany | swarm | | [@Jeff](https://github.com/Ultimate-Storm) | | ODELIA | VHIO | Madrid, Spain | radiomics | | [@Adrià](adriamarcos@vhio.net) | | | UKA | Aachen, Germany | swarm | | [@Gustav](gumueller@ukaachen.de) | | | RADBOUD | Nijmegen, Netherlands | swarm | | [@Tianyu](t.zhang@nki.nl) | | | MITERA | Paul, Greece | | | | | | RIBERA | Lopez, Spain | | | | | | UTRECHT | | | | | | | CAMBRIDGE | Nick, Britain | | | | | | ZURICH | Sreenath, Switzerland | | | | | SWAG | | | swarm | | | | DECADE | | | swarm | | | | Other nodes | UCHICAGO | Chicago, USA | swarm | | [@Sid](Siddhi.Ramesh@uchospitals.edu) | ## Models implemented TUD benchmarking on Duke breast mri dataset:![TUD experiments result.png](assets%2FTUD%20experiments%20result.png) Report: [Swarm learning report.pdf](assets%2FSwarm%20learning%20report.pdf) ## Maintainers TUD Swarm learning team [@Jeff](https://github.com/Ultimate-Storm). Wanna a 24-hours support? Configure your TeamViewer with the following steps and contact me through slack. Thanks [@Adrià](adriamarcos@vhio.net) for instructions. 1. Enable remote control in the ubuntu settings ![ubuntu_remote_control.png](assets%2Fubuntu_remote_control.png) 2. Install TeamViewer and login with username: `adriamarcos@vhio.net` and password: `2wuHih4qC5tEREM` 3. Add the computer to the account ones, so that it can be controlled. You have it here: [link](https://community.teamviewer.com/English/kb/articles/4464-assign-a-device-to-your-accountuthorize)![TV add device.png](assets%2FTV%20add%20device.png) 4. I'd advise you to set the computer to never enter the sleeping mode or darken the screen just in case. Also, if you want to use different users remember this has to be done in all them and the TV session need to be signed in all them as well. ## Milestone See this [link](https://github.com/KatherLab/swarm-learning-hpe/milestones) ## NotionPage See this [link](https://www.notion.so/SWARM-Learning-87a7b920c88e445d81420573afb0e8ab) ## Contributing Feel free to dive in! [Open an issue](https://github.com/KatherLab/swarm-learning-hpe/issues) or submit PRs. Before creating a pull request, please take some time to take a look at our [wiki page](https://github.com/KatherLab/swarm-learning-hpe/wiki), to ensure good code quality and sufficient documentation. Don't need to follow all of the guidelines at this moment, but it would be really helpful! ### Contributors This project exists thanks to all the people who contribute. [@Oliver](https://github.com/oliversaldanha25) [@Kevin](https://github.com/kevinxpfeiffer) ## Credits This project uses platform from the following repositories: - [HewlettPackard/swarm-learning](https://github.com/HewlettPackard/swarm-learning): Created by [HewlettPackard](https://github.com/HewlettPackard) ## License [MIT](LICENSE)