KatherLab / swarm-learning-hpe

Experimental repo for Odelia project based on HPE platform. This repo contains multiple models for histopathology and radiology training.
MIT License
12 stars 1 forks source link

send error issue for sl node #25

Closed Ultimate-Storm closed 1 year ago

Ultimate-Storm commented 1 year ago

SN node is up:

eef04c6908d647aa2d3f21063f2f7ec71bff987be8e0110b7087db583c91660f
######################################################################
##                    HPE SWARM LEARNING SN NODE                    ##
######################################################################
## © Copyright 2019-2022 Hewlett Packard Enterprise Development LP  ##
######################################################################
2023-03-02 12:16:21,395 : swarm.blCnt : INFO : Setting up blockchain layer for the swarm node: START
2023-03-02 12:16:22,639 : swarm.blCnt : INFO : Creating Autopass License Provider
2023-03-02 12:16:23,228 : swarm.blCnt : INFO : Creating license server
2023-03-02 12:16:23,228 : swarm.blCnt : INFO : Setting license servers
2023-03-02 12:16:23,288 : swarm.blCnt : INFO : Acquiring floating license 1100000380:1
2023-03-02 12:16:35,702 : swarm.SN : INFO : SMLETHNode: Starting GETH ... 
2023-03-02 12:16:45,759 : swarm.SN : WARNING : SMLETHNode: Enode list is empty: Node is standalone
2023-03-02 12:19:01,043 : swarm.SN : INFO : SMLETHNode: Started I-am-Alive thread
2023-03-02 12:19:01,043 : swarm.blCnt : INFO : Setting up blockchain layer for the swarm node: FINISHED
2023-03-02 12:19:01,675 : swarm.blCnt : INFO : Starting SWARM-API-SERVER on port: 30304

run sl node with script:

#!/bin/bash

set -euo pipefail

# Get the directory containing this script
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

# Remove any stopped containers
docker rm $(docker ps --filter status=exited -q) || true

# Define a help function
show_help() {
  echo "Usage: $(basename "$0") [-w WORKSPACE]"
  echo ""
  echo "Launch a swarm learning container"
  echo ""
  echo "Options:"
  echo "  -w WORKSPACE    The name of the workspace directory to use (default: none)"
  echo "  -h              Show this help message"
  echo ""
  exit 0
}

# Process command line options
while getopts ":w:h" opt; do
  case ${opt} in
    w ) workspace=${OPTARG} ;;
    h ) show_help ;;
    \? ) show_help ;;
    : ) echo "Option -$OPTARG requires an argument." >&2; exit 1 ;;
  esac
done
ip_addr=$(ip addr show tun0 | grep 'inet ' | awk '{print $2}' | cut -f1 -d'/')

# Launch the swarm learning container with the specified options
"$script_dir"/../../swarm_learning_scripts/run-sl \
  --name=sl1 \
  --host-ip="$ip_addr" \
  --sn-ip="$ip_addr"\
  --sn-api-port=30304 \
  --sl-fs-port=16000 \
  --key=/opt/hpe/swarm-learning-hpe/cert/sl-TUD-key.pem \
  --cert=/opt/hpe/swarm-learning-hpe/cert/sl-TUD-cert.pem \
  --capath=/opt/hpe/swarm-learning-hpe/cert/ca/capath \
  --ml-it \
  --ml-image=user-env-marugoto-swop \
  --ml-name=ml1 \
  --ml-w=/tmp/test \
  --ml-entrypoint=python3 \
  --ml-cmd=model/main.py \
  --ml-v=workspace/"$workspace"/model:/tmp/test/model \
  --ml-e MODEL_DIR=model \
  --ml-e MAX_EPOCHS=5 \
  --ml-e MIN_PEERS=2 \
  --ml-e https_proxy= \
  --apls-ip="$ip_addr"

user-env-marugoto-swop is the docker image created when previously running swci tasks, error message for sl node:

b1e02b74e9a5
c024d14cc58e
f6b4be54f238198f8445ae3b82d8471ce9bb2028f5ad2c18ac01b6f43920ba3b
9b4cd1e1b9142dff749c7f44571f676194e9d2b92a5cc82135c816d069282944
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-0il7218r because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
/tmp/test/model/mil/helpers.py:56: FutureWarning: this interface is deprecated and will be removed in the future.  For training from the command line, please use `marugoto.mil.train`.
  warn(
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/pathlib.py", line 1323, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/platform/scratch/2023_03_02_123111_40-30-10-20_swarm_learning'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/pathlib.py", line 1323, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/platform/scratch'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/test/model/main.py", line 33, in <module>
    train_categorical_model_(
  File "/tmp/test/model/mil/helpers.py", line 63, in train_categorical_model_
    output_path.mkdir(exist_ok=True, parents=True)
  File "/opt/conda/lib/python3.9/pathlib.py", line 1327, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/opt/conda/lib/python3.9/pathlib.py", line 1327, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/opt/conda/lib/python3.9/pathlib.py", line 1323, in mkdir
    self._accessor.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: '/platform'
Ultimate-Storm commented 1 year ago

Fixed with https://github.com/KatherLab/swarm-learning-hpe/blob/1453a6a20f5c261a62560c875d15f41a241ca26e/workspace/automate_scripts/launch_sl/run_sl.sh#L78