bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
Apache License 2.0
643 stars 85 forks source link

`bacalhau node list` returns error `failed request: invalid node type: nodeTypeUndefined` #4024

Closed frrist closed 1 month ago

frrist commented 1 month ago

Bug Description

See title

Expected Behavior

It lists the nodes

Steps to Reproduce

  1. install main
  2. run a server
  3. list nodes
  4. see error

Bacalhau Versions

Host Environment

Provide details about the environment where the bug occurred:

frrist commented 1 month ago

I am working from the staging cluster which contains 4 nodes.. It appears there is an extra node (a 5th) in the store somewhere that is of type undefined causing these errors:

  "NextToken": "",
  "Nodes": [
      "Info": {
        "NodeID": "",
        "NodeType": "nodeTypeUndefined",
        "Labels": null,
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
      "Membership": "",
      "Connection": "DISCONNECTED"
      "Info": {
        "NodeID": "QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        "ComputeNodeInfo": {
          "ExecutionEngines": [
          "Publishers": [
          "StorageSources": [
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
      "Membership": "",
      "Connection": "CONNECTED"
      "Info": {
        "NodeID": "QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        "ComputeNodeInfo": {
          "ExecutionEngines": [
          "Publishers": [
          "StorageSources": [
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
      "Membership": "",
      "Connection": "CONNECTED"
      "Info": {
        "NodeID": "Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "GPU-0": "Tesla-T4",
          "GPU-0-Memory": "15360-MiB",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        "ComputeNodeInfo": {
          "ExecutionEngines": [
          "Publishers": [
          "StorageSources": [
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
      "Membership": "",
      "Connection": "CONNECTED"
      "Info": {
        "NodeID": "QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        "ComputeNodeInfo": {
          "ExecutionEngines": [
          "Publishers": [
          "StorageSources": [
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
      "Membership": "APPROVED",
      "Connection": "CONNECTED"
frrist commented 1 month ago

I believe this may be an issue related to previous state in the clusters node store, as filtering for nodes that are connected work, but disconnected does not:

frrist@cypress ~> bacalhau node list --filter-status=connected
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute            CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute            CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute            CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=disconnected
Error: failed request: invalid node type: nodeTypeUndefined
  bacalhau node list [flags]

      --filter-approval string   Filter nodes by approval. One of: ["approved" "pending" "rejected"]
      --filter-status string     Filter nodes by status. One of: ["connected" "disconnected"]
  -h, --help                     help for list
      --hide-header              do not print the column headers.
      --labels string            Filter nodes by labels. See for more information.
      --limit uint32             Limit the number of results returned
      --next-token string        Next token to use for pagination
      --no-style                 remove all styling from table output.
      --order-by string          Order results by a field. Valid fields are: id, type, available_cpu, available_memory, available_disk, available_gpu, status
      --order-reversed           Reverse the order of the results
      --output format            The output format for the command (one of ["table" "csv" "json" "yaml"]) (default table)
      --pretty                   Pretty print the output. Only applies to json and yaml output formats.
      --show strings             What column groups to show. Zero or more of: ["labels" "version" "features" "capacity"] (default [labels,capacity])
      --wide                     Print full values in the table results

Global Flags:
      --api-host string         The host for the client and server to communicate on (via REST).
                                Ignored if BACALHAU_API_HOST environment variable is set. (default "")
      --api-port int            The port for the client and server to communicate on (via REST).
                                Ignored if BACALHAU_API_PORT environment variable is set. (default 1234)
      --cacert string           The location of a CA certificate file when self-signed certificates
                                    are used by the server
      --insecure                Enables TLS but does not verify certificates
      --log-mode logging-mode   Log format: 'default','station','json','combined','event' (default default)
      --repo string             path to bacalhau repo (default "/home/frrist/.bacalhau")
      --tls                     Instructs the client to use TLS

failed request: invalid node type: nodeTypeUndefined
frrist commented 1 month ago

I have performed the following operations on each node in the cluster to remove the invalid node from the state of the requester node:

bacalhau-vm-stage-0 (requester+compute)

systemctl stop bacalhau
rm /data/compute_store/QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv.registration.lock
rm /data/orchestrator_store/nats-store
systemctl start bacalhau

bacalhau-vm-stage-1 (compute)

systemctl stop bacalhau
rm compute_store/QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ.registration.lock
systemctl start bacalhau

bacalhau-vm-stage-2 (compute)

systemctl stop bacalhau
rm compute_store/QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K.registration.lock
systemctl start bacalhau

bacalhau-vm-stage-3 (compute)

systemctl stop bacalhau
rm compute_store/Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw.registration.lock
systemctl start bacalhau

The node list command is now working as expected:

 frrist@cypress ~> bacalhau node list
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute  APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=connected
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute  APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=disconnected
frrist commented 1 month ago

The cause of this issue relates to changes in the state contained within the NodeStore (NATS kv Store) between v1.3.0 and v1.3.1-rc-1.

In v1.3.0 the NodeStore operates over, and contains, NodeInfo:

In v1.3.1-rc1 the NodeStore operates over, and contains NodeState

NodeInfo cannot be unmarshaled into a NodeState type which is why list show a node with undefined fields. Its data from v1.3.0 contained in the store that no longer meets the requirements of v1.3.1-rc1

How did we get here? After v1.3.0 was release several changes were made to the NodeInfo type:

  1. An Approval field was added to track node membership
  2. A State field was added to track a nodes connection state
  3. Shortly after, a bug was discovered in the logic of the change mentioned in 1. and 2.
  4. A fix was created and merged to address the bug:
    • The fix was validated for compatibility at the protocol level, meaning:
    • v1.3.0 Requester communicating with a v1.3.0 Compute.
    • v1.3.0 Requester communicating with a v1.3.1-rc Compute
    • v1.3.1-rc1 Requester communicating with v1.3.0 Computer.
    • v1.3.1-rc1 Requester communicating with v1.3.1-rc1 Compute

The problem here is that it was never validated to ensure a v1.3.1-rc Requester could open a v1.3.0 Requester store. The fix here appears to be one of:

  1. Implement a repo migration that deletes the kv store from the requester. Further remove the sentinel file compute nodes uses to track registration. This will force the requester to create a new node store, and ensure compute nodes previously connect to it re-register.
  2. Write a migration for the requester nodes NodeInfo store. Given the requirement on a nats transport being avaiavle to access the store this solution is more complicated and ends up being pretty ugly in practice.
  3. Tell users to manually delete their requester NodeStore, remove their compute node registration file, and then restart their compute nodes (no one is going to love this)