Inventory Operator : Error when discovery pod name exceeds 63 characters

Describe the bug Inventory operator log shows: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]

Pod name example created by hardware discovery:

operator-inventory-hardware-discovery-bdl-computer-wildponyexpress

The inventory will not show for the provider.

To Reproduce Add a node with a long hostname.

Expected behavior Inventory operator can use a shorter pod name like : akash-discovery-$hostname, rather than operator-inventory-hardware-discovery-$hostname

Screenshots

@88plug - could you please verify - when this issue is encountered have you experienced inventory operator functional issues? Or has the name length warning been observed but no functional issue presents itself?

Built a couple of test providers/clusters with names that provoke the name length warning. And while the warning is present in logs - it has presented no functional impact to the inventory operator and as per captures of an example provider build below. Want to ensure we understand the severity and observed impact fully.

Details of testing conducted

Results shown from single provider but same observations across multiple clusters
Node names used:

kubectl get nodes
NAME                           STATUS   ROLES                       AGE    VERSION
bdl-computer-wildponyexpress   Ready    control-plane,etcd,master   119m   v1.29.6+k3s2

Errors present in inventory operator logs due to length of inventory discovery pod name length being over 23 characters:

kubectl logs operator-inventory-84f87b58bb-c88ml -n akash-services
I[2024-07-24|17:58:11.545] using in cluster kube config                 cmp=provider
INFO    nodes.node.monitor  starting    {"node": "bdl-computer-wildponyexpress"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "bdl-computer-wildponyexpress"}
INFO    rancher    ADDED monitoring StorageClass    {"name": "local-path"}
W0724 17:58:13.596387       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
W0724 17:58:14.603637       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
W0724 17:58:15.608134       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "bdl-computer-wildponyexpress"}
INFO    nodes.node.monitor  started {"node": "bdl-computer-wildponyexpress"}

Despite DNS name warning there have been no observed impact on inventory operator or bid process. Example snippet from gRPC output from provider/inventory operator showing correct GPU/other specs:

grpcurl -insecure provider.akashtesting.xyz:8444 akash.provider.v1.ProviderRPC.GetStatus
{
  "cluster": {
    "leases": {},
    "inventory": {
      "cluster": {
        "nodes": [
          {
            "name": "bdl-computer-wildponyexpress",
            "resources": {
              "cpu": {
                "quantity": {
                  "allocatable": {
                    "string": "16"
                  },
                  "allocated": {
                    "string": "2050m"
                  }
                },
                "info": [
                  {
                    "id": "0",
                    "vendor": "GenuineIntel",
                    "model": "Intel(R) Xeon(R) CPU @ 2.30GHz",
                    "vcores": 16
                  }
                ]
              },
              "memory": {
                "quantity": {
                  "allocatable": {
                    "string": "63185473536"
                  },
                  "allocated": {
                    "string": "998Mi"
                  }
                }
              },
              "gpu": {
                "quantity": {
                  "allocatable": {
                    "string": "1"
                  },
                  "allocated": {
                    "string": "0"
                  }
                },
                "info": [
                  {
                    "vendor": "nvidia",
                    "name": "t4",
                    "modelid": "1eb8",
                    "interface": "PCIe",
                    "memorySize": "16Gi"
                  }
                ]
              },

Name length warning been observed but no functional issue presents itself - that is correct, however the warning was persistent enough and bold enough to raise to this issue.

this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters

Now in addition, today I got:

ERROR   nodes.node.monitor  couldn't apply patches for node "akash-node12"  {"error": "Node \"akash-node12\" is invalid: metadata.labels: Invalid value: \"akash.network/capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie\": name part must be no more than 63 characters"}

This same issue (63 characters) is now causing a full ERROR when trying to label a node.

While these two issues certainly appear identical/very similar they are quite different in that:

1). Original issue is a DNS name length warning.

2). Second issue is a Kubernetes label limitation. In the example the long model name of rtx2070super is causing the entire label of capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie to be greater than 63 characters which fails due to K8s label limitations of:

Label Key: Must be 63 characters or less, starting with a letter or number, and containing only letters, numbers, dashes (-), underscores (_), and dots (.).
Label Value: Must be 63 characters or less, following the same character constraints as the key. Values can be empty, but they still count towards the overall limit.

The label in this example is one character too long at 64 characters.

For the sake of clarity - would you mind opening a new issue regarding the K8s max label matter? And will keep this issue open for the DNS warning. In the meantime will ensure core team is aware of this matter encountered when GPU type in example - or any long model name - will provoke. Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.

For the sake of clarity - would you mind opening a new issue regarding the K8s max label matter? And will keep this issue open for the DNS warning. In the meantime will ensure core team is aware of this matter encountered when GPU type in example - or any long model name - will provoke. Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.

Looks like the warning did it's job - followed by a real error soon enough albeit on a different object (label). The fundamental learning is that Kubernetes is very sensitive to 63 characters in programming, as documented in most Objects.

Labels and Selectors Object Names and IDs Annotations Jobs

There is a helpful Medium post detailing the intricacies of the issue.

While these two issues certainly appear identical/very similar they are quite different in that:

1). Original issue is a DNS name length warning.

Object Names and IDs was triggered.

2). Second issue is a Kubernetes label limitation. In the example the long model name of rtx2070super is causing the entire label of capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie to be greater than 63 characters which fails due to K8s label limitations of:
* Label Key: Must be 63 characters or less, starting with a letter or number, and containing only letters, numbers, dashes (-), underscores (_), and dots (.).

* Label Value: Must be 63 characters or less, following the same character constraints as the key. Values can be empty, but they still count towards the overall limit.
The label in this example is one character too long at 64 characters.

Labels and Selectors was triggered.

Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.

This was the solve I had in expected behavior comment of this issue.

Next Steps?

Keep this issue and reduce the length of the pod names used by inventory operator
Open new issue for fixing GPU labeling including the label and selector requirements

akash-network / support

Inventory Operator : Error when discovery pod name exceeds 63 characters #233