bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
643 stars 85 forks source link

`bacalhau node list` returns error `failed request: invalid node type: nodeTypeUndefined` #4024

Closed frrist closed 1 month ago

frrist commented 1 month ago

Bug Description

See title

Expected Behavior

It lists the nodes

Steps to Reproduce

  1. install main
  2. run a server
  3. list nodes
  4. see error

Bacalhau Versions

Host Environment

Provide details about the environment where the bug occurred:

frrist commented 1 month ago

I am working from the staging cluster which contains 4 nodes.. It appears there is an extra node (a 5th) in the store somewhere that is of type undefined causing these errors:

curl http://bootstrap.staging.bacalhau.org:1234/api/v1/orchestrator/nodes
{
  "NextToken": "",
  "Nodes": [
    {
      "Info": {
        "NodeID": "",
        "NodeType": "nodeTypeUndefined",
        "Labels": null,
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "DISCONNECTED"
    },
    {
      "Info": {
        "NodeID": "QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "s3",
            "local",
            "noop",
            "ipfs"
          ],
          "StorageSources": [
            "inline",
            "repoclone",
            "repoclonelfs",
            "s3",
            "ipfs",
            "urldownload"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "CONNECTED"
    },
    {
      "Info": {
        "NodeID": "QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "noop",
            "ipfs",
            "s3",
            "local"
          ],
          "StorageSources": [
            "inline",
            "repoclone",
            "repoclonelfs",
            "s3",
            "ipfs",
            "urldownload"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "CONNECTED"
    },
    {
      "Info": {
        "NodeID": "Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "GPU-0": "Tesla-T4",
          "GPU-0-Memory": "15360-MiB",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "local",
            "noop",
            "ipfs",
            "s3"
          ],
          "StorageSources": [
            "repoclonelfs",
            "s3",
            "ipfs",
            "urldownload",
            "inline",
            "repoclone"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
              {
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
              }
            ]
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
              {
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
              }
            ]
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
              {
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
              }
            ]
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "CONNECTED"
    },
    {
      "Info": {
        "NodeID": "QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "s3",
            "local",
            "noop",
            "ipfs"
          ],
          "StorageSources": [
            "ipfs",
            "urldownload",
            "inline",
            "repoclone",
            "repoclonelfs",
            "s3"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "APPROVED",
      "Connection": "CONNECTED"
    }
  ]
}
frrist commented 1 month ago

I believe this may be an issue related to previous state in the clusters node store, as filtering for nodes that are connected work, but disconnected does not:

export BACALHAU_API_HOST=bootstrap.staging.bacalhau.org
frrist@cypress ~> bacalhau node list --filter-status=connected
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute            CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute            CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute            CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=disconnected
Error: failed request: invalid node type: nodeTypeUndefined
Usage:
  bacalhau node list [flags]

Flags:
      --filter-approval string   Filter nodes by approval. One of: ["approved" "pending" "rejected"]
      --filter-status string     Filter nodes by status. One of: ["connected" "disconnected"]
  -h, --help                     help for list
      --hide-header              do not print the column headers.
      --labels string            Filter nodes by labels. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for more information.
      --limit uint32             Limit the number of results returned
      --next-token string        Next token to use for pagination
      --no-style                 remove all styling from table output.
      --order-by string          Order results by a field. Valid fields are: id, type, available_cpu, available_memory, available_disk, available_gpu, status
      --order-reversed           Reverse the order of the results
      --output format            The output format for the command (one of ["table" "csv" "json" "yaml"]) (default table)
      --pretty                   Pretty print the output. Only applies to json and yaml output formats.
      --show strings             What column groups to show. Zero or more of: ["labels" "version" "features" "capacity"] (default [labels,capacity])
      --wide                     Print full values in the table results

Global Flags:
      --api-host string         The host for the client and server to communicate on (via REST).
                                Ignored if BACALHAU_API_HOST environment variable is set. (default "bootstrap.production.bacalhau.org")
      --api-port int            The port for the client and server to communicate on (via REST).
                                Ignored if BACALHAU_API_PORT environment variable is set. (default 1234)
      --cacert string           The location of a CA certificate file when self-signed certificates
                                    are used by the server
      --insecure                Enables TLS but does not verify certificates
      --log-mode logging-mode   Log format: 'default','station','json','combined','event' (default default)
      --repo string             path to bacalhau repo (default "/home/frrist/.bacalhau")
      --tls                     Instructs the client to use TLS

failed request: invalid node type: nodeTypeUndefined
frrist commented 1 month ago

I have performed the following operations on each node in the cluster to remove the invalid node from the state of the requester node:

bacalhau-vm-stage-0 (requester+compute)

systemctl stop bacalhau
rm /data/compute_store/QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv.registration.lock
rm /data/orchestrator_store/nats-store
systemctl start bacalhau

bacalhau-vm-stage-1 (compute)

systemctl stop bacalhau
rm compute_store/QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ.registration.lock
systemctl start bacalhau

bacalhau-vm-stage-2 (compute)

systemctl stop bacalhau
rm compute_store/QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K.registration.lock
systemctl start bacalhau

bacalhau-vm-stage-3 (compute)

systemctl stop bacalhau
rm compute_store/Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw.registration.lock
systemctl start bacalhau

The node list command is now working as expected:

 frrist@cypress ~> bacalhau node list
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute  APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=connected
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute  APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=disconnected
 ID  TYPE  APPROVAL  STATUS  LABELS  CPU  MEMORY  DISK  GPU 
frrist commented 1 month ago

The cause of this issue relates to changes in the state contained within the NodeStore (NATS kv Store) between v1.3.0 and v1.3.1-rc-1.

In v1.3.0 the NodeStore operates over, and contains, NodeInfo: https://github.com/bacalhau-project/bacalhau/blob/b09858fc4d7659dcf0a7229e97f3900694991c47/pkg/routing/kvstore/kvstore.go#L70-L82

https://github.com/bacalhau-project/bacalhau/blob/b09858fc4d7659dcf0a7229e97f3900694991c47/pkg/model/nodeinfo.go#L25-L31

In v1.3.1-rc1 the NodeStore operates over, and contains NodeState https://github.com/bacalhau-project/bacalhau/blob/3b3d8a8a7e4ddd79f38a4cd4b731885237e45245/pkg/routing/kvstore/kvstore.go#L70-L82

https://github.com/bacalhau-project/bacalhau/blob/3b3d8a8a7e4ddd79f38a4cd4b731885237e45245/pkg/models/node_state.go#L5-L9

NodeInfo cannot be unmarshaled into a NodeState type which is why list show a node with undefined fields. Its data from v1.3.0 contained in the store that no longer meets the requirements of v1.3.1-rc1

How did we get here? After v1.3.0 was release several changes were made to the NodeInfo type:

  1. An Approval field was added to track node membership
  2. A State field was added to track a nodes connection state
  3. Shortly after, a bug was discovered in the logic of the change mentioned in 1. and 2.
  4. A fix was created and merged to address the bug: https://github.com/bacalhau-project/bacalhau/pull/3785
    • The fix was validated for compatibility at the protocol level, meaning:
    • v1.3.0 Requester communicating with a v1.3.0 Compute.
    • v1.3.0 Requester communicating with a v1.3.1-rc Compute
    • v1.3.1-rc1 Requester communicating with v1.3.0 Computer.
    • v1.3.1-rc1 Requester communicating with v1.3.1-rc1 Compute

The problem here is that it was never validated to ensure a v1.3.1-rc Requester could open a v1.3.0 Requester store. The fix here appears to be one of:

  1. Implement a repo migration that deletes the kv store from the requester. Further remove the sentinel file compute nodes uses to track registration. This will force the requester to create a new node store, and ensure compute nodes previously connect to it re-register. https://github.com/bacalhau-project/bacalhau/pull/4030
  2. Write a migration for the requester nodes NodeInfo store. Given the requirement on a nats transport being avaiavle to access the store this solution is more complicated and ends up being pretty ugly in practice. https://github.com/bacalhau-project/bacalhau/pull/4029
  3. Tell users to manually delete their requester NodeStore, remove their compute node registration file, and then restart their compute nodes (no one is going to love this)