Closed PNg-HA closed 6 months ago
Did you restart sl/ml manually (docker start/restart)?
No. I didn't touch sl/ml.
@PNg-HA, Thanks for providing all the logs and details about the issue. As I can see you have updated SWOP profile on both the hosts for adding additional SL nodes.
Looking at the SL logs, each SL is exposing its file server port on 30305 which is default port, and it gets stuck in Merging stage. Refer about exposed port here: https://github.com/HewlettPackard/swarm-learning/blob/master/docs/Install/Exposed_port_numbers.md
Host 1 SL1:
2023-12-09 18:18:02,213 : swarm.initSL : INFO : Setting up SL Container : START
2023-12-09 18:18:02,254 : swarm.initSL : INFO : Instantiating file services for SL Container 192.168.120.190:30305
localIp for fs service = 172.25.0.5
Host 1 SL2:
2023-12-09 18:18:02,905 : swarm.initSL : INFO : Setting up SL Container : START
2023-12-09 18:18:03,078 : swarm.initSL : INFO : Instantiating file services for SL Container 192.168.120.190:30305
localIp for fs service = 172.25.0.7
To solve this provide different free ports in SWOP profile for each SL node per host. You need to update "slport: null" field in SWOP profile on both the hosts for each SL node. Instead of 'null' provide different port numbers. Refer existing mnist example SWOP profile.
Strange, I have seen this issue when I restarted sl/ml manually.
Hi @PNg-HA, did you try suggested updates?
Hi @PNg-HA, did you try suggested updates?
Hi @iArpanPatel. Thank you for your detailed guide. I keep default slport (~) in one SL in each SWOP profile so one SL container in each host does not export the port (my lazy mistake when do not change all ports like your guide). I will change all ports and notify if there are still errors.
I met a new error in the sentinel SN (host 1):
I have updated slport in both hosts. In host 1, sl1 with port 16000, sl2 with port 17000. In host 2, sl3 with port 1600, sl4 with port 17000. sn2 is normal:
Here is the logs of other containers: custom_mnist_2.zip
The new error is fixed in Swarm learning version 2.2.0. Thank you very much!
Issue description
issue description: When I configure the MNIST example so that each host can have 2 SL nodes, the SN nodes returns the error related to the enrolled ID.
occurrence - consistent or rare: consistent
error messages: SLBlackBoardObj : errUidNotMatchingWithEnroll:UID PASSED WITH CHECKIN IS NOT MATCHING WITH ENROLLED UID
commands used for starting SN1:
./scripts/bin/run-sn -d --rm --name=sn1 \ --network=host-1-net --host-ip=${HOST_1_IP} \ --sentinel --sn-p2p-port=${SN_P2P_PORT} \ --sn-api-port=${SN_API_PORT} \ --key=workspace/mnist/cert/sn-1-key.pem \ --cert=workspace/mnist/cert/sn-1-cert.pem \ --capath=workspace/mnist/cert/ca/capath \ --apls-ip=${APLS_IP}
commands used for starting SN2:
./scripts/bin/run-sn -d --rm --name=sn2 \ --network=host-2-net --host-ip=${HOST_2_IP} \ --sentinel-ip=${SN_1_IP} --sn-p2p-port=${SN_P2P_PORT} \ --sn-api-port=${SN_API_PORT} --key=workspace/mnist/cert/sn-2-key.pem \ --cert=workspace/mnist/cert/sn-2-cert.pem \ --capath=workspace/mnist/cert/ca/capath \ --apls-ip=${APLS_IP}
commands used for starting SWOP1:
./scripts/bin/run-swop -d --name=swop1 --network=host-1-net \ --sn-ip=${SN_1_IP} --sn-api-port=${SN_API_PORT} \ --usr-dir=workspace/mnist/swop --profile-file-name=swop1_profile.yaml \ --key=workspace/mnist/cert/swop-1-key.pem \ --cert=workspace/mnist/cert/swop-1-cert.pem \ --capath=workspace/mnist/cert/ca/capath -e SWOP_KEEP_CONTAINERS=True -e http_proxy= -e https_proxy= \ --apls-ip=${APLS_IP}
command for SWOP2:
./scripts/bin/run-swop -d --name=swop2 --network=host-2-net \ --sn-ip=${SN_2_IP} --sn-api-port=${SN_API_PORT} \ --usr-dir=workspace/mnist/swop --profile-file-name=swop2_profile.yaml \ --key=workspace/mnist/cert/swop-2-key.pem \ --cert=workspace/mnist/cert/swop-2-cert.pem \ --capath=workspace/mnist/cert/ca/capath -e SWOP_KEEP_CONTAINERS=True -e http_proxy= -e https_proxy= \ --apls-ip=${APLS_IP}
command for SWCI in host 1:
./scripts/bin/run-swci --name=swci1 --network=host-1-net \ --usr-dir=workspace/mnist/swci --init-script-name=swci-init \ --key=workspace/mnist/cert/swci-1-key.pem \ --cert=workspace/mnist/cert/swci-1-cert.pem \ --capath=workspace/mnist/cert/ca/capath \ -e http_proxy= -e https_proxy= --apls-ip=${APLS_IP}
docker logs [APLS, SPIRE, SN, SL, SWCI]:
Host 1:
Host 2:
APLS:
SN1:
SN2:
Swarm Learning Version: 2.1.0
OS and ML Platform
Additional notes