Closed yingca1 closed 1 year ago
looks like this target somehow ended up with invalid cluster map - happens when you play with deployment and redeployment, so try to cleanup its metadata and restart
separately. pub port 51081 - is it on purpose? suggesting to check everything related to networking configuration, including MTU - lookup "jumbo"
looks like this target somehow ended up with invalid cluster map - happens when you play with deployment and redeployment, so try to cleanup its metadata and restart
As it is a pod in Kubernetes, if the health check fails, it will keep restarting and repeating the error. Could you please clarify which files or operations "cleanup its metadata" refers to?
separately. pub port 51081 - is it on purpose? suggesting to check everything related to networking configuration, including MTU - lookup "jumbo"
pub port 51081
refers to the code here. I'm not sure if it's correct or why it's configured as 51081. Thank you very much for your suggestion. I'll try configuring the MTU right away.
@yingca1 do you mind elaborating a bit on your setup and how we can reproduce this error?
I would be interested in knowing:
@saiprashanth173
Where you are deploying your aistore cluster, e.g. minikube, baremetal K8s cluster or a public cloud K8s offering?
I launched and tested it on GKE using Terraform.
Do you see these logs in all your targets?
Does it happen when you apply you create the AIStore cluster for the first time? Or you see it when you upgrade/patch existing cluster?
This error often occurs, for example, when initializing with 3 proxies and 3 targets, manually scaling the targets to 5 may very likely result in this error. Once, when I deleted all Kubernetes resources and redeployed, all targets reported similar errors.
Question: Is there a way to manually add a target to the aistore cluster in a Kubernetes cluster? For example, if I increase the liveness time, can I manually fix the issue of not being able to join the cluster during that time?
we are adding and removing targets all the time, it's part of the story since day one
by way of background info:
mountpath
)I guess in your case smap
, i.e. cluster map, from your previous deployment is causing your target to fail.
One option would be to clean up metadata directory /etc/ais
directory (see) on each node before you try to deploy a new aistore cluster and test it. Maybe it also makes sense to delete your old PV(C)s. Unfortunately, we don't have any helpers to do that on a GKE setup, so you might have to do it manually. Here is the ansible role we use for a baremetal setup if you would like to take a look as a reference.
Currently, we mostly use GKE in our CI, where we setup our GKE cluster using terraform, deploy aistore, run tests, which also include scale up (adding targets), scale down (removing targets) etc. After we run all our tests, we destroy the entire cluster. So, we never needed such utilities.
For reference, some tests cases that run similar scenarios: https://github.com/NVIDIA/ais-k8s/blob/master/operator/tests/integration/cluster_test.go#L164 https://github.com/NVIDIA/ais-k8s/blob/master/operator/tests/integration/cluster_test.go#L240
Question: Is there a way to manually add a target to the aistore cluster in a Kubernetes cluster? For example, if I increase the liveness time, can I manually fix the issue of not being able to join the cluster during that time?
Please let us know if cleaning metadata directory fixes your issue.
Could you elaborate on how you plan to manually fix the issue? In general, if your aistore cluster is being managed by ais operator, I would prefer using the scale up feature to add a new target instead trying to add it manually.
Here's what I did and it seems like the problem hasn't occurred again:
Upgraded GKE from version 1.25.8-gke.1000 (default) to 1.26.3-gke.1000 (I was originally planning to change the MTU, but it was mentioned that GKE version 1.26.1 and later can inherit from the primary interface of the Node).
I only upgraded the Kubernetes cluster version here, without changing the MTU.
Added sysctls configuration in Terraform, following this guide. The commented out part is not supported by Terraform.
...
node_config {
linux_node_config {
sysctls = {
"net.core.somaxconn" = "100000"
"net.ipv4.tcp_tw_reuse" = "1"
# "net.ipv4.ip_local_port_range" = "2048 65535"
# "net.ipv4.tcp_max_tw_buckets" = "1440000"
# "net.ipv4.ip_forward" = "1"
"net.core.rmem_max" = "268435456"
"net.core.wmem_max" = "268435456"
# "net.core.rmem_default" = "25165824"
"net.core.optmem_max" = "25165824"
"net.core.netdev_max_backlog" = "250000"
"net.ipv4.tcp_wmem" = "4096 12582912 268435456"
"net.ipv4.tcp_rmem" = "4096 12582912 268435456"
# "net.ipv4.tcp_adv_win_scale" = "1"
# "net.ipv4.tcp_mtu_probing" = "2"
# "net.ipv4.tcp_slow_start_after_idle" = "0"
# "net.ipv4.tcp_low_latency" = "1"
# "net.ipv4.tcp_timestamps" = "0"
# "vm.vfs_cache_pressure" = "50"
# "net.ipv4.tcp_max_syn_backlog" = "100000"
# "net.ipv4.tcp_rfc1337" = "1"
# "vm.swappiness" = "10"
# "vm.min_free_kbytes" = "262144"
}
}
...
}
Destroyed the old Kubernetes cluster and reconfigured it.
Could you elaborate on how you plan to manually fix the issue? In general, if your aistore cluster is being managed by ais operator, I would prefer using the scale up feature to add a new target instead trying to add it manually.
@saiprashanth173 To be honest, after creating a complete aistore cluster using the operator, I dumped all the resources of the ais-operator-system in Kubernetes into a YAML file. This allowed me to quickly debug by directly modifying the YAML resource file
This issue has come up again, but I think I've found the root cause.
TODO: Please wait for me to update here later
Update
To solve the error in the issue, it's clear that deleting the metadata in /etc/ais
is an effective method.
The root cause of my problem was due to the following actions taken without deleting the Kubernetes cluster:
All of these situations can easily cause this problem.
The operator can deploy using either proxy or target, both of which cache the state in the state-mount
pod /etc/ais mount at kubernetes node path.Join(ais.Spec.HostpathPrefix, ais.Namespace, ais.Name, daeType)
proxy
target:
@saiprashanth173 To be honest, after creating a complete aistore cluster using the operator, I dumped all the resources of the ais-operator-system in Kubernetes into a YAML file. This allowed me to quickly debug by directly modifying the YAML resource file
I am a bit curious how fixing the YAML resources fixed this issue. To me from the error log you shared, this seems to be storage target issue, and has less to do with Kubernetes resources itself.
E 19:59:43.963996 htrun.go:429 FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3] FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3]
It would be helpful if you can also share what you had to change in YAMLs to be to able to debug and fix this issue. We can use that to decide if we need to fix/extend the K8s operator.
This issue has come up again, but I think I've found the root cause.
TODO: Please wait for me to update here later
That is amazing! Will wait for you update :slightly_smiling_face:. Thanks!
maybe a little bit more info on the original error, and I'm not sure you need it at this point:
I 19:59:42.787654 daemon.go:175 Version 3.17.66420ae, build time 2023-06-08T21:39:50+0000, debug false
First, there's the version. It's a not yet released v3.18 which's in progress, changing fast, and a bit risky to use. We typically release aistore and ais operator more or less at the same time, as a package.
Secondly, there's the question of persistence:
FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3]
On the one hand, cluster map (aka Smap
) is stored in the host under /etc/ais
. On the other hand, target ID
is stored at the root of each mountpath
on data drives that Kubernetes calls "persistent volumes". One would hope, they are persistent enough and have the right affinity, etc. Something to maybe check.
deploy method:
ais-k8s operator
docker image:
error log:
How do I appropriately handle this type of question?