bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
642 stars 85 forks source link

Computre nodes fail to reconnect to requester after the requester goes offline. #4114

Closed frrist closed 6 days ago

frrist commented 1 week ago

Steps to reproduce:

  1. Open a terminal window. Run: bacalhau serve --node-type=requester
  2. Open a second terminal windown. Run: bacalhau --repo=~/.bacalhau_compute serve --port=1236 --node-type=compute --network=nats --orchestrators=nats://0.0.0.0:4222
  3. Assert the compute node and requester node are connected:
    bacalhau node list
    ID          TYPE       APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
    n-a98459de  Compute    APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=8192-MiB            25.6 /  37.6 GB /   1.2 TB /     1 /  
                                             GPU-0=NVIDIA-GeForce-RTX-2080-SUPER                 25.6    37.6 GB     1.2 TB       1    
                                             Operating-System=linux git-lfs=true                                                       
    n-c68688ce  Requester  APPROVED             Architecture=amd64 Operating-System=linux                                                 
  4. Kill the requester node.
  5. Compute node will print logs that indicate connection timeout, then eventually connection closed:
    09:32:10.54 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
    09:32:10.54 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:32:40.54 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateNodeInfo/v1 error="nats: timeout" [NodeID:n-a98459de]
    09:32:40.54 | ERR pkg/compute/management_client.go:115 > failed to send update info to requester node error="failed to send response to update info request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateInfo","line":"76","source":"management_proxy.go"},{"func":"(*ManagementClient).deliverInfo","line":"111","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"178","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:32:42.541 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
    09:32:42.541 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:33:10.54 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
    09:33:10.54 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:33:40.541 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateNodeInfo/v1 error="nats: timeout" [NodeID:n-a98459de]
    09:33:40.541 | ERR pkg/compute/management_client.go:115 > failed to send update info to requester node error="failed to send response to update info request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateInfo","line":"76","source":"management_proxy.go"},{"func":"(*ManagementClient).deliverInfo","line":"111","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"178","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:33:42.542 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: timeout" [NodeID:n-a98459de]
    09:33:42.542 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: timeout" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:34:08.539 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateResources/v1 error="nats: connection closed" [NodeID:n-a98459de]
    09:34:08.539 | WRN pkg/compute/management_client.go:136 > failed to send resource update to requester node error="failed to send response to update resources request: nats: connection closed" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateResources","line":"93","source":"management_proxy.go"},{"func":"(*ManagementClient).updateResources","line":"134","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"181","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
    09:34:08.539 | ERR pkg/compute/management_client.go:142 > heartbeat failed sending sequence 11 error="nats: connection closed" [NodeID:n-a98459de]
    09:34:23.539 | ERR pkg/compute/management_client.go:142 > heartbeat failed sending sequence 12 error="nats: connection closed" [NodeID:n-a98459de]
    09:34:38.539 | WRN pkg/nats/proxy/management_proxy.go:118 > error sending request to subject node.management.n-a98459de-37d1-43bc-853b-b4c4408e42d7.UpdateNodeInfo/v1 error="nats: connection closed" [NodeID:n-a98459de]
    09:34:38.539 | ERR pkg/compute/management_client.go:115 > failed to send update info to requester node error="failed to send response to update info request: nats: connection closed" [NodeID:n-a98459de] [stack:[{"func":"(*ManagementProxy).UpdateInfo","line":"76","source":"management_proxy.go"},{"func":"(*ManagementClient).deliverInfo","line":"111","source":"management_client.go"},{"func":"(*ManagementClient).Start","line":"178","source":"management_client.go"},{"func":"goexit","line":"1650","source":"asm_amd64.s"}]]
  6. Restart the requester node.
  7. Observe the aforementioned log messages continue to print and the requester continues to mark the compute nodes as offline:
    bacalhau node list
    ID          TYPE       APPROVAL  STATUS        LABELS                                              CPU     MEMORY      DISK         GPU  
    n-a98459de  Compute    APPROVED  DISCONNECTED  Architecture=amd64 GPU-0-Memory=8192-MiB            25.6 /  37.6 GB /   1.2 TB /     1 /  
                                                GPU-0=NVIDIA-GeForce-RTX-2080-SUPER                 25.6    37.6 GB     1.2 TB       1    
                                                Operating-System=linux git-lfs=true                                                       
    n-c68688ce  Requester  APPROVED                Architecture=amd64 Operating-System=linux                                                 
frrist commented 1 week ago

It's worth acknowledging that when the compute node is restarted it's able to reconnect to the requester node as expected. This may be indicative of bugs in our re-connection logic on the compute node.