Open a terminal window. Run: bacalhau serve --node-type=requester
Open a second terminal windown. Run: bacalhau --repo=~/.bacalhau_compute serve --port=1236 --node-type=compute --network=nats --orchestrators=nats://0.0.0.0:4222
Assert the compute node and requester node are connected:
bacalhau node list
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
n-a98459de Compute APPROVED CONNECTED Architecture=amd64 GPU-0-Memory=8192-MiB 25.6 / 37.6 GB / 1.2 TB / 1 /
GPU-0=NVIDIA-GeForce-RTX-2080-SUPER 25.6 37.6 GB 1.2 TB 1
Operating-System=linux git-lfs=true
n-c68688ce Requester APPROVED Architecture=amd64 Operating-System=linux
Kill the requester node.
Compute node will print logs that indicate connection timeout, then eventually connection closed:
09:32:10.54 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
09:32:10.54 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:32:40.54 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateNodeInfo/v1 error="nats: timeout" [NodeID:n-a98459de]
09:32:40.54 | ERR pkg/compute/management_client.go:115 > failed to send update info to requester node error="failed to send response to update info request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateInfo","line":"76","source":"management_proxy.go"},{"func":"(*ManagementClient).deliverInfo","line":"111","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"178","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:32:42.541 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
09:32:42.541 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:33:10.54 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
09:33:10.54 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:33:40.541 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateNodeInfo/v1 error="nats: timeout" [NodeID:n-a98459de]
09:33:40.541 | ERR pkg/compute/management_client.go:115 > failed to send update info to requester node error="failed to send response to update info request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateInfo","line":"76","source":"management_proxy.go"},{"func":"(*ManagementClient).deliverInfo","line":"111","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"178","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:33:42.542 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
09:33:42.542 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:34:08.539 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: connection closed" [NodeID:n-a98459de]
09:34:08.539 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: connection closed" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
09:34:08.539 | ERR pkg/compute/management_client.go:142 > heartbeat failed sending sequence 11 error="nats: connection closed" [NodeID:n-a98459de]
09:34:23.539 | ERR pkg/compute/management_client.go:142 > heartbeat failed sending sequence 12 error="nats: connection closed" [NodeID:n-a98459de]
09:34:38.539 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateNodeInfo/v1 error="nats: connection closed" [NodeID:n-a98459de]
09:34:38.539 | ERR pkg/compute/management_client.go:115 > failed to send update info to requester node error="failed to send response to update info request: nats: connection closed" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateInfo","line":"76","source":"management_proxy.go"},{"func":"(*ManagementClient).deliverInfo","line":"111","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"178","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
Restart the requester node.
Observe the aforementioned log messages continue to print and the requester continues to mark the compute nodes as offline:
bacalhau node list
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
n-a98459de Compute APPROVED DISCONNECTED Architecture=amd64 GPU-0-Memory=8192-MiB 25.6 / 37.6 GB / 1.2 TB / 1 /
GPU-0=NVIDIA-GeForce-RTX-2080-SUPER 25.6 37.6 GB 1.2 TB 1
Operating-System=linux git-lfs=true
n-c68688ce Requester APPROVED Architecture=amd64 Operating-System=linux
It's worth acknowledging that when the compute node is restarted it's able to reconnect to the requester node as expected. This may be indicative of bugs in our re-connection logic on the compute node.
Steps to reproduce:
bacalhau serve --node-type=requester
bacalhau --repo=~/.bacalhau_compute serve --port=1236 --node-type=compute --network=nats --orchestrators=nats://0.0.0.0:4222
connection timeout
, then eventuallyconnection closed
: