akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

Inventory Operator : Error when discovery pod name exceeds 63 characters #233

Open 88plug opened 3 months ago

88plug commented 3 months ago

Describe the bug Inventory operator log shows: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]

Pod name example created by hardware discovery:

operator-inventory-hardware-discovery-bdl-computer-wildponyexpress

The inventory will not show for the provider.

To Reproduce Add a node with a long hostname.

Expected behavior Inventory operator can use a shorter pod name like : akash-discovery-$hostname, rather than operator-inventory-hardware-discovery-$hostname

Screenshots image

chainzero commented 2 months ago

@88plug - could you please verify - when this issue is encountered have you experienced inventory operator functional issues? Or has the name length warning been observed but no functional issue presents itself?

Built a couple of test providers/clusters with names that provoke the name length warning. And while the warning is present in logs - it has presented no functional impact to the inventory operator and as per captures of an example provider build below. Want to ensure we understand the severity and observed impact fully.

Details of testing conducted

kubectl get nodes
NAME                           STATUS   ROLES                       AGE    VERSION
bdl-computer-wildponyexpress   Ready    control-plane,etcd,master   119m   v1.29.6+k3s2
kubectl logs operator-inventory-84f87b58bb-c88ml -n akash-services
I[2024-07-24|17:58:11.545] using in cluster kube config                 cmp=provider
INFO    nodes.node.monitor  starting    {"node": "bdl-computer-wildponyexpress"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "bdl-computer-wildponyexpress"}
INFO    rancher    ADDED monitoring StorageClass    {"name": "local-path"}
W0724 17:58:13.596387       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
W0724 17:58:14.603637       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
W0724 17:58:15.608134       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "bdl-computer-wildponyexpress"}
INFO    nodes.node.monitor  started {"node": "bdl-computer-wildponyexpress"}
grpcurl -insecure provider.akashtesting.xyz:8444 akash.provider.v1.ProviderRPC.GetStatus
{
  "cluster": {
    "leases": {},
    "inventory": {
      "cluster": {
        "nodes": [
          {
            "name": "bdl-computer-wildponyexpress",
            "resources": {
              "cpu": {
                "quantity": {
                  "allocatable": {
                    "string": "16"
                  },
                  "allocated": {
                    "string": "2050m"
                  }
                },
                "info": [
                  {
                    "id": "0",
                    "vendor": "GenuineIntel",
                    "model": "Intel(R) Xeon(R) CPU @ 2.30GHz",
                    "vcores": 16
                  }
                ]
              },
              "memory": {
                "quantity": {
                  "allocatable": {
                    "string": "63185473536"
                  },
                  "allocated": {
                    "string": "998Mi"
                  }
                }
              },
              "gpu": {
                "quantity": {
                  "allocatable": {
                    "string": "1"
                  },
                  "allocated": {
                    "string": "0"
                  }
                },
                "info": [
                  {
                    "vendor": "nvidia",
                    "name": "t4",
                    "modelid": "1eb8",
                    "interface": "PCIe",
                    "memorySize": "16Gi"
                  }
                ]
              },
88plug commented 2 months ago

Name length warning been observed but no functional issue presents itself - that is correct, however the warning was persistent enough and bold enough to raise to this issue.

this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters

Now in addition, today I got:

ERROR   nodes.node.monitor  couldn't apply patches for node "akash-node12"  {"error": "Node \"akash-node12\" is invalid: metadata.labels: Invalid value: \"akash.network/capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie\": name part must be no more than 63 characters"}

This same issue (63 characters) is now causing a full ERROR when trying to label a node.

chainzero commented 2 months ago

While these two issues certainly appear identical/very similar they are quite different in that:

1). Original issue is a DNS name length warning.

2). Second issue is a Kubernetes label limitation. In the example the long model name of rtx2070super is causing the entire label of capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie to be greater than 63 characters which fails due to K8s label limitations of:

The label in this example is one character too long at 64 characters.

For the sake of clarity - would you mind opening a new issue regarding the K8s max label matter? And will keep this issue open for the DNS warning. In the meantime will ensure core team is aware of this matter encountered when GPU type in example - or any long model name - will provoke. Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.

88plug commented 2 months ago

For the sake of clarity - would you mind opening a new issue regarding the K8s max label matter? And will keep this issue open for the DNS warning. In the meantime will ensure core team is aware of this matter encountered when GPU type in example - or any long model name - will provoke. Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.

Looks like the warning did it's job - followed by a real error soon enough albeit on a different object (label). The fundamental learning is that Kubernetes is very sensitive to 63 characters in programming, as documented in most Objects.

Labels and Selectors Object Names and IDs Annotations Jobs

There is a helpful Medium post detailing the intricacies of the issue.

While these two issues certainly appear identical/very similar they are quite different in that:

1). Original issue is a DNS name length warning.

Object Names and IDs was triggered.

2). Second issue is a Kubernetes label limitation. In the example the long model name of rtx2070super is causing the entire label of capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie to be greater than 63 characters which fails due to K8s label limitations of:

* Label Key: Must be 63 characters or less, starting with a letter or number, and containing only letters, numbers, dashes (-), underscores (_), and dots (.).

* Label Value: Must be 63 characters or less, following the same character constraints as the key. Values can be empty, but they still count towards the overall limit.

The label in this example is one character too long at 64 characters.

Labels and Selectors was triggered.

Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.

This was the solve I had in expected behavior comment of this issue.

Next Steps?

  1. Keep this issue and reduce the length of the pod names used by inventory operator
  2. Open new issue for fixing GPU labeling including the label and selector requirements