Closed frrist closed 1 month ago
I am working from the staging cluster which contains 4 nodes.. It appears there is an extra node (a 5th) in the store somewhere that is of type undefined causing these errors:
curl http://bootstrap.staging.bacalhau.org:1234/api/v1/orchestrator/nodes
{
"NextToken": "",
"Nodes": [
{
"Info": {
"NodeID": "",
"NodeType": "nodeTypeUndefined",
"Labels": null,
"BacalhauVersion": {
"GitVersion": "",
"GitCommit": "",
"BuildDate": "0001-01-01T00:00:00Z",
"GOOS": "",
"GOARCH": ""
}
},
"Membership": "",
"Connection": "DISCONNECTED"
},
{
"Info": {
"NodeID": "QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K",
"NodeType": "Compute",
"Labels": {
"Architecture": "amd64",
"Operating-System": "linux",
"git-lfs": "false",
"owner": "bacalhau"
},
"ComputeNodeInfo": {
"ExecutionEngines": [
"docker",
"wasm"
],
"Publishers": [
"s3",
"local",
"noop",
"ipfs"
],
"StorageSources": [
"inline",
"repoclone",
"repoclonelfs",
"s3",
"ipfs",
"urldownload"
],
"MaxCapacity": {
"CPU": 3.2,
"Memory": 13406204723,
"Disk": 83047314227
},
"QueueCapacity": {},
"AvailableCapacity": {
"CPU": 3.2,
"Memory": 13406204723,
"Disk": 83047314227
},
"MaxJobRequirements": {
"CPU": 3.2,
"Memory": 13406204723,
"Disk": 83047314227
},
"RunningExecutions": 0,
"EnqueuedExecutions": 0
},
"BacalhauVersion": {
"GitVersion": "",
"GitCommit": "",
"BuildDate": "0001-01-01T00:00:00Z",
"GOOS": "",
"GOARCH": ""
}
},
"Membership": "",
"Connection": "CONNECTED"
},
{
"Info": {
"NodeID": "QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ",
"NodeType": "Compute",
"Labels": {
"Architecture": "amd64",
"Operating-System": "linux",
"git-lfs": "false",
"owner": "bacalhau"
},
"ComputeNodeInfo": {
"ExecutionEngines": [
"docker",
"wasm"
],
"Publishers": [
"noop",
"ipfs",
"s3",
"local"
],
"StorageSources": [
"inline",
"repoclone",
"repoclonelfs",
"s3",
"ipfs",
"urldownload"
],
"MaxCapacity": {
"CPU": 3.2,
"Memory": 13406204723,
"Disk": 83046763724
},
"QueueCapacity": {},
"AvailableCapacity": {
"CPU": 3.2,
"Memory": 13406204723,
"Disk": 83046763724
},
"MaxJobRequirements": {
"CPU": 3.2,
"Memory": 13406204723,
"Disk": 83046763724
},
"RunningExecutions": 0,
"EnqueuedExecutions": 0
},
"BacalhauVersion": {
"GitVersion": "",
"GitCommit": "",
"BuildDate": "0001-01-01T00:00:00Z",
"GOOS": "",
"GOARCH": ""
}
},
"Membership": "",
"Connection": "CONNECTED"
},
{
"Info": {
"NodeID": "Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw",
"NodeType": "Compute",
"Labels": {
"Architecture": "amd64",
"GPU-0": "Tesla-T4",
"GPU-0-Memory": "15360-MiB",
"Operating-System": "linux",
"git-lfs": "false",
"owner": "bacalhau"
},
"ComputeNodeInfo": {
"ExecutionEngines": [
"docker",
"wasm"
],
"Publishers": [
"local",
"noop",
"ipfs",
"s3"
],
"StorageSources": [
"repoclonelfs",
"s3",
"ipfs",
"urldownload",
"inline",
"repoclone"
],
"MaxCapacity": {
"CPU": 3.2,
"Memory": 12560636313,
"Disk": 32934753075,
"GPU": 1,
"GPUs": [
{
"Index": 0,
"Name": "Tesla T4",
"Vendor": "NVIDIA",
"Memory": 15360,
"PCIAddress": ""
}
]
},
"QueueCapacity": {},
"AvailableCapacity": {
"CPU": 3.2,
"Memory": 12560636313,
"Disk": 32934753075,
"GPU": 1,
"GPUs": [
{
"Index": 0,
"Name": "Tesla T4",
"Vendor": "NVIDIA",
"Memory": 15360,
"PCIAddress": ""
}
]
},
"MaxJobRequirements": {
"CPU": 3.2,
"Memory": 12560636313,
"Disk": 32934753075,
"GPU": 1,
"GPUs": [
{
"Index": 0,
"Name": "Tesla T4",
"Vendor": "NVIDIA",
"Memory": 15360,
"PCIAddress": ""
}
]
},
"RunningExecutions": 0,
"EnqueuedExecutions": 0
},
"BacalhauVersion": {
"GitVersion": "",
"GitCommit": "",
"BuildDate": "0001-01-01T00:00:00Z",
"GOOS": "",
"GOARCH": ""
}
},
"Membership": "",
"Connection": "CONNECTED"
},
{
"Info": {
"NodeID": "QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv",
"NodeType": "Compute",
"Labels": {
"Architecture": "amd64",
"Operating-System": "linux",
"git-lfs": "false",
"owner": "bacalhau"
},
"ComputeNodeInfo": {
"ExecutionEngines": [
"docker",
"wasm"
],
"Publishers": [
"s3",
"local",
"noop",
"ipfs"
],
"StorageSources": [
"ipfs",
"urldownload",
"inline",
"repoclone",
"repoclonelfs",
"s3"
],
"MaxCapacity": {
"CPU": 3.2,
"Memory": 13406208000,
"Disk": 79441883955
},
"QueueCapacity": {},
"AvailableCapacity": {
"CPU": 3.2,
"Memory": 13406208000,
"Disk": 79441883955
},
"MaxJobRequirements": {
"CPU": 3.2,
"Memory": 13406208000,
"Disk": 79441883955
},
"RunningExecutions": 0,
"EnqueuedExecutions": 0
},
"BacalhauVersion": {
"GitVersion": "",
"GitCommit": "",
"BuildDate": "0001-01-01T00:00:00Z",
"GOOS": "",
"GOARCH": ""
}
},
"Membership": "APPROVED",
"Connection": "CONNECTED"
}
]
}
I believe this may be an issue related to previous state in the clusters node store, as filtering for nodes that are connected work, but disconnected does not:
export BACALHAU_API_HOST=bootstrap.staging.bacalhau.org
frrist@cypress ~> bacalhau node list --filter-status=connected
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
QmRr9qPT Compute CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 77.3 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 77.3 GB 0
QmVHCeiL Compute CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 77.3 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 77.3 GB 0
Qma5yQAk Compute CONNECTED Architecture=amd64 GPU-0-Memory=15360-MiB 3.2 / 11.7 GB / 30.7 GB / 1 /
GPU-0=Tesla-T4 Operating-System=linux 3.2 11.7 GB 30.7 GB 1
git-lfs=false owner=bacalhau
QmafZ9oC Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 74.0 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 74.0 GB 0
frrist@cypress ~> bacalhau node list --filter-status=disconnected
Error: failed request: invalid node type: nodeTypeUndefined
Usage:
bacalhau node list [flags]
Flags:
--filter-approval string Filter nodes by approval. One of: ["approved" "pending" "rejected"]
--filter-status string Filter nodes by status. One of: ["connected" "disconnected"]
-h, --help help for list
--hide-header do not print the column headers.
--labels string Filter nodes by labels. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for more information.
--limit uint32 Limit the number of results returned
--next-token string Next token to use for pagination
--no-style remove all styling from table output.
--order-by string Order results by a field. Valid fields are: id, type, available_cpu, available_memory, available_disk, available_gpu, status
--order-reversed Reverse the order of the results
--output format The output format for the command (one of ["table" "csv" "json" "yaml"]) (default table)
--pretty Pretty print the output. Only applies to json and yaml output formats.
--show strings What column groups to show. Zero or more of: ["labels" "version" "features" "capacity"] (default [labels,capacity])
--wide Print full values in the table results
Global Flags:
--api-host string The host for the client and server to communicate on (via REST).
Ignored if BACALHAU_API_HOST environment variable is set. (default "bootstrap.production.bacalhau.org")
--api-port int The port for the client and server to communicate on (via REST).
Ignored if BACALHAU_API_PORT environment variable is set. (default 1234)
--cacert string The location of a CA certificate file when self-signed certificates
are used by the server
--insecure Enables TLS but does not verify certificates
--log-mode logging-mode Log format: 'default','station','json','combined','event' (default default)
--repo string path to bacalhau repo (default "/home/frrist/.bacalhau")
--tls Instructs the client to use TLS
failed request: invalid node type: nodeTypeUndefined
I have performed the following operations on each node in the cluster to remove the invalid node from the state of the requester node:
bacalhau-vm-stage-0 (requester+compute)
systemctl stop bacalhau
rm /data/compute_store/QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv.registration.lock
rm /data/orchestrator_store/nats-store
systemctl start bacalhau
bacalhau-vm-stage-1 (compute)
systemctl stop bacalhau
rm compute_store/QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ.registration.lock
systemctl start bacalhau
bacalhau-vm-stage-2 (compute)
systemctl stop bacalhau
rm compute_store/QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K.registration.lock
systemctl start bacalhau
bacalhau-vm-stage-3 (compute)
systemctl stop bacalhau
rm compute_store/Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw.registration.lock
systemctl start bacalhau
The node list command is now working as expected:
frrist@cypress ~> bacalhau node list
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
QmRr9qPT Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 77.3 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 77.3 GB 0
QmVHCeiL Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 77.3 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 77.3 GB 0
Qma5yQAk Compute APPROVED CONNECTED Architecture=amd64 GPU-0-Memory=15360-MiB 3.2 / 11.7 GB / 30.7 GB / 1 /
GPU-0=Tesla-T4 Operating-System=linux 3.2 11.7 GB 30.7 GB 1
git-lfs=false owner=bacalhau
QmafZ9oC Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 74.0 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 74.0 GB 0
frrist@cypress ~> bacalhau node list --filter-status=connected
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
QmRr9qPT Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 77.3 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 77.3 GB 0
QmVHCeiL Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 77.3 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 77.3 GB 0
Qma5yQAk Compute APPROVED CONNECTED Architecture=amd64 GPU-0-Memory=15360-MiB 3.2 / 11.7 GB / 30.7 GB / 1 /
GPU-0=Tesla-T4 Operating-System=linux 3.2 11.7 GB 30.7 GB 1
git-lfs=false owner=bacalhau
QmafZ9oC Compute APPROVED CONNECTED Architecture=amd64 Operating-System=linux 3.2 / 12.5 GB / 74.0 GB / 0 /
git-lfs=false owner=bacalhau 3.2 12.5 GB 74.0 GB 0
frrist@cypress ~> bacalhau node list --filter-status=disconnected
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
The cause of this issue relates to changes in the state contained within the NodeStore (NATS kv Store) between v1.3.0 and v1.3.1-rc-1.
In v1.3.0 the NodeStore operates over, and contains, NodeInfo: https://github.com/bacalhau-project/bacalhau/blob/b09858fc4d7659dcf0a7229e97f3900694991c47/pkg/routing/kvstore/kvstore.go#L70-L82
In v1.3.1-rc1 the NodeStore operates over, and contains NodeState https://github.com/bacalhau-project/bacalhau/blob/3b3d8a8a7e4ddd79f38a4cd4b731885237e45245/pkg/routing/kvstore/kvstore.go#L70-L82
NodeInfo cannot be unmarshaled into a NodeState type which is why list show a node with undefined fields. Its data from v1.3.0 contained in the store that no longer meets the requirements of v1.3.1-rc1
How did we get here? After v1.3.0 was release several changes were made to the NodeInfo type:
The problem here is that it was never validated to ensure a v1.3.1-rc Requester could open a v1.3.0 Requester store. The fix here appears to be one of:
Bug Description
See title
Expected Behavior
It lists the nodes
Steps to Reproduce
Bacalhau Versions
v1.3.1-rc1
Host Environment
Provide details about the environment where the bug occurred: