NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

The aistore cluster's target node is not stable and is easily rejected by the cluster. #134

Closed yingca1 closed 1 year ago

yingca1 commented 1 year ago

deploy method: ais-k8s operator

docker image:

error log:

aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=target -alsologtostderr=true -stderrthreshold=1  -allow_shared_no_disks=false
W 19:59:42.787163 config.go:1253 control and data share one intra-cluster network (aistore1-target-2.aistore1-target.ais.svc.cluster.local)
I 19:59:42.787625 config.go:1796 log.dir: "/var/log/ais"; l4.proto: tcp; pub port: 51081; verbosity: 3
I 19:59:42.787637 config.go:1798 config: "/etc/ais/.ais.conf"; stats_time: 10s; authentication: false; backends: [aws gcp]
I 19:59:42.787654 daemon.go:175 Version 3.17.66420ae, build time 2023-06-08T21:39:50+0000, debug false
I 19:59:42.787660 daemon.go:183 CPUs(4, runtime=4), containerized
I 19:59:42.787824 util.go:39 Verifying type of deployment (HOSTNAME: "aistore1-target-2", K8S_NODE_NAME: "")
I 19:59:42.807851 util.go:76 Successfully got node name "gke-ais1-ais1-5aa6ad88-p393", assuming Kubernetes deployment
I 19:59:42.807926 utils.go:149 Found only one IPv4: 10.72.7.9, MTU 1460
W 19:59:42.807935 utils.go:151 IPv4 10.72.7.9 MTU size is small: 1460
I 19:59:42.807947 htrun.go:346 PUBLIC (user) access: [{10.72.7.9 51081 http://10.72.7.9:51081 10.72.7.9:51081}]
I 19:59:42.807955 utils.go:112 Selecting one of the configured IPv4 addresses: [aistore1-target-2.aistore1-target.ais.svc.cluster.local]...
W 19:59:42.807959 utils.go:119 failed to parse IP for hostname "aistore1-target-2.aistore1-target.ais.svc.cluster.local"
W 19:59:42.808205 utils.go:132 Selected IPv4 10.72.7.9 from the configuration file
I 19:59:42.808221 htrun.go:359 INTRA-CONTROL access: [{aistore1-target-2.aistore1-target.ais.svc.cluster.local 51082 http://aistore1-target-2.aistore1-target.ais.svc.cluster.local:51082 aistore1-target-2.aistore1-target.ais.svc.cluster.local:51082}] (config: aistore1-target-2.aistore1-target.ais.svc.cluster.local)
I 19:59:42.808227 utils.go:112 Selecting one of the configured IPv4 addresses: [aistore1-target-2.aistore1-target.ais.svc.cluster.local]...
W 19:59:42.808232 utils.go:119 failed to parse IP for hostname "aistore1-target-2.aistore1-target.ais.svc.cluster.local"
W 19:59:42.808283 utils.go:132 Selected IPv4 10.72.7.9 from the configuration file
I 19:59:42.808292 htrun.go:372 INTRA-DATA access: [{aistore1-target-2.aistore1-target.ais.svc.cluster.local 51083 http://aistore1-target-2.aistore1-target.ais.svc.cluster.local:51083 aistore1-target-2.aistore1-target.ais.svc.cluster.local:51083}] (config: aistore1-target-2.aistore1-target.ais.svc.cluster.local)
I 19:59:42.808492 init.go:213 fHUWONyl.gmm[(used 4GiB, free 4GiB, buffcache 3GiB, actfree 7GiB), (min-free 2GiB, low-wm 4GiB), pressure 'low'] started
I 19:59:42.808599 init.go:213 fHUWONyl.smm[(used 4GiB, free 4GiB, buffcache 3GiB, actfree 7GiB), (min-free 2GiB, low-wm 4GiB), pressure 'low'] started
I 19:59:42.813384 dutils_linux.go:109 /dev/nvme0n2: map[nvme0n2:512]
I 19:59:42.813425 vinit.go:79 VMD v1(fHUWONyl, [/ais1])
I 19:59:42.813496 Using Prometheus
I 19:59:43.813746 fshc.go:60 Starting fshc
I 19:59:43.813770 collect.go:48 Intra-cluster networking: fasthttp client
I 19:59:43.813791 collect.go:49 Starting stream_collector
W 19:59:43.963008 bucketmeta.go:376 initializing new BMD v0
I 19:59:43.963036 etlmeta.go:221 initializing new EtlMD v0(0)
E 19:59:43.963996 htrun.go:429 FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3]
FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3]
cat: /var/log/ais/aisnode.INFO: No such file or directory
cat: /var/log/ais/aisnode.ERROR: No such file or directory
cat: /var/log/ais/aisnode.WARNING: No such file or directory

How do I appropriately handle this type of question?

alex-aizman commented 1 year ago

looks like this target somehow ended up with invalid cluster map - happens when you play with deployment and redeployment, so try to cleanup its metadata and restart

alex-aizman commented 1 year ago

separately. pub port 51081 - is it on purpose? suggesting to check everything related to networking configuration, including MTU - lookup "jumbo"

yingca1 commented 1 year ago

looks like this target somehow ended up with invalid cluster map - happens when you play with deployment and redeployment, so try to cleanup its metadata and restart

As it is a pod in Kubernetes, if the health check fails, it will keep restarting and repeating the error. Could you please clarify which files or operations "cleanup its metadata" refers to?

yingca1 commented 1 year ago

separately. pub port 51081 - is it on purpose? suggesting to check everything related to networking configuration, including MTU - lookup "jumbo"

pub port 51081 refers to the code here. I'm not sure if it's correct or why it's configured as 51081. Thank you very much for your suggestion. I'll try configuring the MTU right away.

saiprashanth173 commented 1 year ago

@yingca1 do you mind elaborating a bit on your setup and how we can reproduce this error?

I would be interested in knowing:

  1. Where you are deploying your aistore cluster, e.g. minikube, baremetal K8s cluster or a public cloud K8s offering?
  2. Do you see these logs in all your targets?
  3. Does it happen when you apply you create the AIStore cluster for the first time? Or you see it when you upgrade/patch existing cluster?
yingca1 commented 1 year ago

@saiprashanth173

  1. Where you are deploying your aistore cluster, e.g. minikube, baremetal K8s cluster or a public cloud K8s offering?

    I launched and tested it on GKE using Terraform.

  2. Do you see these logs in all your targets?

  3. Does it happen when you apply you create the AIStore cluster for the first time? Or you see it when you upgrade/patch existing cluster?

    This error often occurs, for example, when initializing with 3 proxies and 3 targets, manually scaling the targets to 5 may very likely result in this error. Once, when I deleted all Kubernetes resources and redeployed, all targets reported similar errors.

Question: Is there a way to manually add a target to the aistore cluster in a Kubernetes cluster? For example, if I increase the liveness time, can I manually fix the issue of not being able to join the cluster during that time?

alex-aizman commented 1 year ago

we are adding and removing targets all the time, it's part of the story since day one

alex-aizman commented 1 year ago

by way of background info:

saiprashanth173 commented 1 year ago

I guess in your case smap, i.e. cluster map, from your previous deployment is causing your target to fail.

One option would be to clean up metadata directory /etc/ais directory (see) on each node before you try to deploy a new aistore cluster and test it. Maybe it also makes sense to delete your old PV(C)s. Unfortunately, we don't have any helpers to do that on a GKE setup, so you might have to do it manually. Here is the ansible role we use for a baremetal setup if you would like to take a look as a reference.

Currently, we mostly use GKE in our CI, where we setup our GKE cluster using terraform, deploy aistore, run tests, which also include scale up (adding targets), scale down (removing targets) etc. After we run all our tests, we destroy the entire cluster. So, we never needed such utilities.

For reference, some tests cases that run similar scenarios: https://github.com/NVIDIA/ais-k8s/blob/master/operator/tests/integration/cluster_test.go#L164 https://github.com/NVIDIA/ais-k8s/blob/master/operator/tests/integration/cluster_test.go#L240

Question: Is there a way to manually add a target to the aistore cluster in a Kubernetes cluster? For example, if I increase the liveness time, can I manually fix the issue of not being able to join the cluster during that time?

Please let us know if cleaning metadata directory fixes your issue.

Could you elaborate on how you plan to manually fix the issue? In general, if your aistore cluster is being managed by ais operator, I would prefer using the scale up feature to add a new target instead trying to add it manually.

yingca1 commented 1 year ago

Here's what I did and it seems like the problem hasn't occurred again:

  1. Upgraded GKE from version 1.25.8-gke.1000 (default) to 1.26.3-gke.1000 (I was originally planning to change the MTU, but it was mentioned that GKE version 1.26.1 and later can inherit from the primary interface of the Node).

    I only upgraded the Kubernetes cluster version here, without changing the MTU.

  2. Added sysctls configuration in Terraform, following this guide. The commented out part is not supported by Terraform.

    ...
    node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn"                 = "100000"
        "net.ipv4.tcp_tw_reuse"              = "1"
        # "net.ipv4.ip_local_port_range"       = "2048 65535"
        # "net.ipv4.tcp_max_tw_buckets"        = "1440000"
        # "net.ipv4.ip_forward"                = "1"
        "net.core.rmem_max"                  = "268435456"
        "net.core.wmem_max"                  = "268435456"
        # "net.core.rmem_default"              = "25165824"
        "net.core.optmem_max"                = "25165824"
        "net.core.netdev_max_backlog"        = "250000"
        "net.ipv4.tcp_wmem"                  = "4096    12582912  268435456"
        "net.ipv4.tcp_rmem"                  = "4096    12582912 268435456"
        # "net.ipv4.tcp_adv_win_scale"         = "1"
        # "net.ipv4.tcp_mtu_probing"           = "2"
        # "net.ipv4.tcp_slow_start_after_idle" = "0"
        # "net.ipv4.tcp_low_latency"           = "1"
        # "net.ipv4.tcp_timestamps"            = "0"
        # "vm.vfs_cache_pressure"              = "50"
        # "net.ipv4.tcp_max_syn_backlog"       = "100000"
        # "net.ipv4.tcp_rfc1337"               = "1"
        # "vm.swappiness"                      = "10"
        # "vm.min_free_kbytes"                 = "262144"
      }
    }
    ...
    }
  3. Destroyed the old Kubernetes cluster and reconfigured it.

yingca1 commented 1 year ago

Could you elaborate on how you plan to manually fix the issue? In general, if your aistore cluster is being managed by ais operator, I would prefer using the scale up feature to add a new target instead trying to add it manually.

@saiprashanth173 To be honest, after creating a complete aistore cluster using the operator, I dumped all the resources of the ais-operator-system in Kubernetes into a YAML file. This allowed me to quickly debug by directly modifying the YAML resource file

yingca1 commented 1 year ago

This issue has come up again, but I think I've found the root cause.

TODO: Please wait for me to update here later


Update

To solve the error in the issue, it's clear that deleting the metadata in /etc/ais is an effective method.

The root cause of my problem was due to the following actions taken without deleting the Kubernetes cluster:

  1. Dynamically adding and removing Kubernetes node
  2. Deleting and redeploying AIStore
  3. Manually deleting the PV associated with the target pod

All of these situations can easily cause this problem.

  1. A target node is necessary to form an AIStore cluster. Only proxy nodes cannot complete the cluster.
  2. When the proxy-1 pod is deleted and rebuilt, it may read metadata generated by other proxy nodes, such as proxy-2.
  3. When the target statefulset is deleted and redeployed, the PV will be deleted as well, and the node_id of the target node is stored in the PV.
node_id and smap info in my case ```bash root@aistore1-proxy-0:/etc/ais# ls -al total 36 drwxr-xr-x 2 root root 4096 Jun 22 13:56 . drwxr-xr-x 1 root root 4096 Jun 22 13:36 .. -rw-r----- 1 root root 706 Jun 22 13:41 .ais.bmd -rw-r----- 1 root root 1880 Jun 22 13:36 .ais.conf -rw-r----- 1 root root 8 Jun 22 13:36 .ais.proxy_id -rw-r----- 1 root root 59 Jun 22 13:56 .ais.rmd -rw-r----- 1 root root 1034 Jun 22 13:36 .ais.smap -rw-r--r-- 1 root root 2515 Jun 22 13:36 ais.json -rw-r--r-- 1 root root 368 Jun 22 13:36 ais_local.json root@aistore1-proxy-0:/var/ais_env# ls -al total 16 drwxr-xr-x 2 root root 4096 Jun 22 13:36 . drwxr-xr-x 1 root root 4096 Jun 22 13:36 .. -rw-r--r-- 1 root root 27 Jun 22 13:36 env root@aistore1-target-1:/etc/ais# ls -al total 24 drwxr-xr-x 2 root root 4096 Jun 22 13:36 . drwxr-xr-x 1 root root 4096 Jun 22 13:36 .. -rw-r----- 1 root root 1880 Jun 22 13:36 .ais.conf -rw-r----- 1 root root 1034 Jun 22 13:36 .ais.smap -rw-r--r-- 1 root root 0 Jun 22 13:36 ais.db -rw-r--r-- 1 root root 2515 Jun 22 13:36 ais.json -rw-r--r-- 1 root root 393 Jun 22 13:36 ais_local.json root@aistore1-target-1:/etc/ais# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 100G 0 disk |-nvme0n1p1 259:1 0 99.9G 0 part /var/ais_config/ais.json | /var/ais_config/ais_liveness.sh | /var/ais_config/ais_readiness.sh | /var/ais_config | /dev/termination-log | /var/statsd_config | /etc/ais | /var/ais_env | /etc/hostname | /etc/resolv.conf | /etc/hosts |-nvme0n1p14 259:2 0 4M 0 part `-nvme0n1p15 259:3 0 106M 0 part nvme0n2 259:4 0 10G 0 disk /ais1 root@aistore1-target-1:/ais1# ls -al total 44 drwxr-xr-x 5 root root 4096 Jun 22 13:41 . drwxr-xr-x 1 root root 4096 Jun 22 13:36 .. -rw-r----- 1 root root 706 Jun 22 13:41 .ais.bmd -rw-r----- 1 root root 109 Jun 22 13:36 .ais.bmd.prev drwxr-x--- 2 root root 4096 Jun 22 13:56 .ais.markers -rw-r----- 1 root root 204 Jun 22 13:31 .ais.vmd drwxr-x--- 3 root root 4096 Jun 22 13:41 @gcp drwx------ 2 root root 16384 Jun 22 13:30 lost+found root@aistore1-target-1:/ais1# cat .ais.vmd aistore�J�<�tt"Md@���{"version":"1","mountpaths":{"/ais1":{#":�a,"fs":"/dev/nvme0n2","fs_type":"ext4","fs_id":"-210104127,-1438059594","enabled":true}},"daemon_id":"iLXaCcQh"} ```

The operator can deploy using either proxy or target, both of which cache the state in the state-mount

pod /etc/ais mount at kubernetes node path.Join(ais.Spec.HostpathPrefix, ais.Namespace, ais.Name, daeType)

proxy

target:

saiprashanth173 commented 1 year ago

@saiprashanth173 To be honest, after creating a complete aistore cluster using the operator, I dumped all the resources of the ais-operator-system in Kubernetes into a YAML file. This allowed me to quickly debug by directly modifying the YAML resource file

I am a bit curious how fixing the YAML resources fixed this issue. To me from the error log you shared, this seems to be storage target issue, and has less to do with Kubernetes resources itself.

E 19:59:43.963996 htrun.go:429 FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3] FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3]

It would be helpful if you can also share what you had to change in YAMLs to be to able to debug and fix this issue. We can use that to decide if we need to fix/extend the K8s operator.

This issue has come up again, but I think I've found the root cause.

TODO: Please wait for me to update here later

That is amazing! Will wait for you update :slightly_smiling_face:. Thanks!

alex-aizman commented 1 year ago

maybe a little bit more info on the original error, and I'm not sure you need it at this point:

I 19:59:42.787654 daemon.go:175 Version 3.17.66420ae, build time 2023-06-08T21:39:50+0000, debug false

First, there's the version. It's a not yet released v3.18 which's in progress, changing fast, and a bit risky to use. We typically release aistore and ais operator more or less at the same time, as a package.

Secondly, there's the question of persistence:

FATAL ERROR: [bad cluster map]: t[fHUWONyl] is not present in the loaded Smap v109[klneDAlWs, p[oJDFAKuH], t=4, p=3]

On the one hand, cluster map (aka Smap) is stored in the host under /etc/ais. On the other hand, target ID is stored at the root of each mountpath on data drives that Kubernetes calls "persistent volumes". One would hope, they are persistent enough and have the right affinity, etc. Something to maybe check.