HewlettPackard / swarm-learning

A simplified library for decentralized, privacy preserving machine learning
Apache License 2.0
328 stars 100 forks source link

Error with Enrolled ID when custom MNIST example #210

Closed PNg-HA closed 6 months ago

PNg-HA commented 9 months ago

Issue description

SN1: image

SN2: image

Swarm Learning Version: 2.1.0

OS and ML Platform

Additional notes

htjain commented 9 months ago

Did you restart sl/ml manually (docker start/restart)?

PNg-HA commented 9 months ago

No. I didn't touch sl/ml.

iArpanPatel commented 9 months ago

@PNg-HA, Thanks for providing all the logs and details about the issue. As I can see you have updated SWOP profile on both the hosts for adding additional SL nodes.

Looking at the SL logs, each SL is exposing its file server port on 30305 which is default port, and it gets stuck in Merging stage. Refer about exposed port here: https://github.com/HewlettPackard/swarm-learning/blob/master/docs/Install/Exposed_port_numbers.md

Host 1 SL1:

2023-12-09 18:18:02,213 : swarm.initSL : INFO : Setting up SL Container :  START
2023-12-09 18:18:02,254 : swarm.initSL : INFO : Instantiating file services for SL Container 192.168.120.190:30305
localIp for fs service = 172.25.0.5

Host 1 SL2:

2023-12-09 18:18:02,905 : swarm.initSL : INFO : Setting up SL Container :  START
2023-12-09 18:18:03,078 : swarm.initSL : INFO : Instantiating file services for SL Container 192.168.120.190:30305
localIp for fs service = 172.25.0.7

To solve this provide different free ports in SWOP profile for each SL node per host. You need to update "slport: null" field in SWOP profile on both the hosts for each SL node. Instead of 'null' provide different port numbers. Refer existing mnist example SWOP profile.

htjain commented 9 months ago

Strange, I have seen this issue when I restarted sl/ml manually.

iArpanPatel commented 9 months ago

Hi @PNg-HA, did you try suggested updates?

PNg-HA commented 9 months ago

Hi @PNg-HA, did you try suggested updates?

Hi @iArpanPatel. Thank you for your detailed guide. I keep default slport (~) in one SL in each SWOP profile so one SL container in each host does not export the port (my lazy mistake when do not change all ports like your guide). I will change all ports and notify if there are still errors.

PNg-HA commented 7 months ago

I met a new error in the sentinel SN (host 1): image

I have updated slport in both hosts. In host 1, sl1 with port 16000, sl2 with port 17000. In host 2, sl3 with port 1600, sl4 with port 17000. sn2 is normal: image

Here is the logs of other containers: custom_mnist_2.zip

PNg-HA commented 6 months ago

The new error is fixed in Swarm learning version 2.2.0. Thank you very much!