Open 88plug opened 3 months ago
@88plug - could you please verify - when this issue is encountered have you experienced inventory operator functional issues? Or has the name length warning been observed but no functional issue presents itself?
Built a couple of test providers/clusters with names that provoke the name length warning. And while the warning is present in logs - it has presented no functional impact to the inventory operator and as per captures of an example provider build below. Want to ensure we understand the severity and observed impact fully.
Details of testing conducted
Results shown from single provider but same observations across multiple clusters
Node names used:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
bdl-computer-wildponyexpress Ready control-plane,etcd,master 119m v1.29.6+k3s2
kubectl logs operator-inventory-84f87b58bb-c88ml -n akash-services
I[2024-07-24|17:58:11.545] using in cluster kube config cmp=provider
INFO nodes.node.monitor starting {"node": "bdl-computer-wildponyexpress"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "bdl-computer-wildponyexpress"}
INFO rancher ADDED monitoring StorageClass {"name": "local-path"}
W0724 17:58:13.596387 7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
W0724 17:58:14.603637 7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
W0724 17:58:15.608134 7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
INFO nodes.node.discovery started hardware discovery pod {"node": "bdl-computer-wildponyexpress"}
INFO nodes.node.monitor started {"node": "bdl-computer-wildponyexpress"}
grpcurl -insecure provider.akashtesting.xyz:8444 akash.provider.v1.ProviderRPC.GetStatus
{
"cluster": {
"leases": {},
"inventory": {
"cluster": {
"nodes": [
{
"name": "bdl-computer-wildponyexpress",
"resources": {
"cpu": {
"quantity": {
"allocatable": {
"string": "16"
},
"allocated": {
"string": "2050m"
}
},
"info": [
{
"id": "0",
"vendor": "GenuineIntel",
"model": "Intel(R) Xeon(R) CPU @ 2.30GHz",
"vcores": 16
}
]
},
"memory": {
"quantity": {
"allocatable": {
"string": "63185473536"
},
"allocated": {
"string": "998Mi"
}
}
},
"gpu": {
"quantity": {
"allocatable": {
"string": "1"
},
"allocated": {
"string": "0"
}
},
"info": [
{
"vendor": "nvidia",
"name": "t4",
"modelid": "1eb8",
"interface": "PCIe",
"memorySize": "16Gi"
}
]
},
Name length warning been observed but no functional issue presents itself - that is correct, however the warning was persistent enough and bold enough to raise to this issue.
this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters
Now in addition, today I got:
ERROR nodes.node.monitor couldn't apply patches for node "akash-node12" {"error": "Node \"akash-node12\" is invalid: metadata.labels: Invalid value: \"akash.network/capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie\": name part must be no more than 63 characters"}
This same issue (63 characters) is now causing a full ERROR when trying to label a node.
While these two issues certainly appear identical/very similar they are quite different in that:
1). Original issue is a DNS name length warning.
2). Second issue is a Kubernetes label limitation. In the example the long model name of rtx2070super
is causing the entire label of capabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie
to be greater than 63 characters which fails due to K8s label limitations of:
Label Key: Must be 63 characters or less, starting with a letter or number, and containing only letters, numbers, dashes (-), underscores (_), and dots (.).
Label Value: Must be 63 characters or less, following the same character constraints as the key. Values can be empty, but they still count towards the overall limit.
The label in this example is one character too long at 64 characters.
For the sake of clarity - would you mind opening a new issue regarding the K8s max label matter? And will keep this issue open for the DNS warning. In the meantime will ensure core team is aware of this matter encountered when GPU type in example - or any long model name - will provoke. Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.
For the sake of clarity - would you mind opening a new issue regarding the K8s max label matter? And will keep this issue open for the DNS warning. In the meantime will ensure core team is aware of this matter encountered when GPU type in example - or any long model name - will provoke. Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.
Looks like the warning did it's job - followed by a real error soon enough albeit on a different object (label). The fundamental learning is that Kubernetes is very sensitive to 63 characters in programming, as documented in most Objects.
Labels and Selectors Object Names and IDs Annotations Jobs
There is a helpful Medium post detailing the intricacies of the issue.
While these two issues certainly appear identical/very similar they are quite different in that:
1). Original issue is a DNS name length warning.
Object Names and IDs was triggered.
2). Second issue is a Kubernetes label limitation. In the example the long model name of
rtx2070super
is causing the entire label ofcapabilities.gpu.vendor.nvidia.model.rtx2070super.interface.pcie
to be greater than 63 characters which fails due to K8s label limitations of:* Label Key: Must be 63 characters or less, starting with a letter or number, and containing only letters, numbers, dashes (-), underscores (_), and dots (.). * Label Value: Must be 63 characters or less, following the same character constraints as the key. Values can be empty, but they still count towards the overall limit.
The label in this example is one character too long at 64 characters.
Labels and Selectors was triggered.
Fix would seem to be shorting the stock text in the labels - I.e. shorten capabilities and/or interface to allow more custom characters derived from model name.
This was the solve I had in expected behavior comment of this issue.
Next Steps?
Describe the bug Inventory operator log shows:
metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
Pod name example created by hardware discovery:
operator-inventory-hardware-discovery-bdl-computer-wildponyexpress
The inventory will not show for the provider.
To Reproduce Add a node with a long hostname.
Expected behavior Inventory operator can use a shorter pod name like :
akash-discovery-$hostname
, rather thanoperator-inventory-hardware-discovery-$hostname
Screenshots