KatherLab / swarm-learning-hpe

Experimental repo for Odelia project based on HPE platform. This repo contains multiple models for histopathology and radiology training.
MIT License
12 stars 1 forks source link

SL nodes haven't been terminated before running the next local compare task #42

Open Ultimate-Storm opened 1 year ago

Ultimate-Storm commented 1 year ago

Port 16000 for sl shows in use since previous sl node has not been stopped


2023-03-15 13:01:08,895 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_task_local_compare_20230315_133712 , opId : 13485391426936112013 - Begins
2023-03-15 13:01:11,950 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_task_local_compare_20230315_133712 , opId : 13485391426936112013 - Ends
2023-03-15 13:01:12,102 : swarm.swop : INFO : SWOPRunTask: Profile validated
2023-03-15 13:01:15,146 : swarm.swop : INFO : SWOPRunTask: APLS configured with non-default port : 5000
2023-03-15 13:01:15,147 : swarm.swop : INFO : SWOPRunTask: SL Image Name : hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl:1.2.0
2023-03-15 13:01:15,242 : swarm.swop : INFO : SWOPRunTask: Arguments passed to User container idx : 0
2023-03-15 13:01:15,243 : swarm.swop : INFO : {'entrypoint': 'python3', 'detach': True, 'auto_remove': False, 'name': 'demo-swarm_task_local_compare_20230315_133712-u-0-cc4936c40dfcbd00', 'hostname': 'user-marugoto_mri-172.24.40.65', 'network': 'host-net', 'ports': {}, 'mounts': [{'Target': '/tmp/hpe-swarm', 'Source': 'swop-demo-swif-0', 'Type': 'volume', 'ReadOnly': False}, {'Target': '/tmp/test/model', 'Source': '/opt/hpe/swarm-learning-hpe/workspace/marugoto_mri/model', 'Type': 'bind', 'ReadOnly': False}, {'Target': '/tmp/test/data-and-scratch', 'Source': '/opt/hpe/swarm-learning-hpe/workspace/marugoto_mri/user/data-and-scratch', 'Type': 'bind', 'ReadOnly': False}], 'environment': {'DATA_DIR': 'data-and-scratch/data', 'SCRATCH_DIR': 'data-and-scratch/scratch', 'MODEL_DIR': 'model', 'MAX_EPOCHS': 100, 'MIN_PEERS': 5, 'LOCAL_COMPARE_FLAG': True, 'USE_ADAPTIVE_SYNC': False, 'SYNC_FREQUENCY': 32, 'MODEL_TYPE': 'transformer', 'SL_REQUEST_CHANNEL': '/tmp/hpe-swarm/demo.0.request.pipe', 'SL_RESPONSE_CHANNEL': '/tmp/hpe-swarm/demo.0.response.pipe'}, 'working_dir': '/tmp/test', 'user': '0:0', 'dns': [], 'labels': None, 'device_requests': [{'Driver': '', 'Count': 0, 'DeviceIDs': ['all'], 'Capabilities': [['gpu']], 'Options': {}}], 'shm_size': '16G'}
2023-03-15 13:01:15,243 : swarm.swop : INFO : SWOPRunTask: USER Image Name : user-env-marugoto-swop
2023-03-15 13:01:18,573 : swarm.swop : INFO : SWOPRunTask: failed to remove : POD : 0 , TYPE : SL , CID :6315ec5fe5a5cc1572cfa11539da519b0e2c876f9740873f4abadf638f87e481
2023-03-15 13:01:18,573 : swarm.swop : INFO : 500 Server Error for http+docker://localhost/v1.41/containers/6315ec5fe5a5cc1572cfa11539da519b0e2c876f9740873f4abadf638f87e481/start: Internal Server Error ("driver failed programming external connectivity on endpoint demo-swarm_task_local_compare_20230315_133712-s-0-cc4936c40dfcbd00 (b9ff2b414f8a2bd5080919ab0946d6004160d4573cbf39c8d22abd41c3933438): Bind for 0.0.0.0:16000 failed: port is already allocated")
2023-03-15 13:01:19,309 : swarm.swop : INFO : SWOPRunTask: Failed to start containers...Stopping Task Execution
2023-03-15 13:01:19,310 : swarm.swop : INFO : SWOPRunTask: Stopping Task```