NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

Terraform Kubernetes deployment on GCP unable to start aistorage/aisnode:latest image #151

Closed Hastyrush closed 11 months ago

Hastyrush commented 1 year ago

Hello,

I was experimenting with the ais-k8s repository and encountered an issue when trying to deploy AIS as pods in Kubernetes on GCP. Since the ais-k8s repository does not have issue tracking, I am posting it here.

After successfully deploying k8s using the script deploy.sh k8s, I proceeded to deploy the AIS pods with the same script deploy.sh ais --expose-external --aisnode-image=aistorage/aisnode:latest --admin-image=aistorage/admin:latest

The admin pod got successfully deployed, but the aisnode image shows status of CrashLoopBackOff image

By inspecting the logs of the container using kubectl logs demo-ais-proxy-0, I get the following errors:

Defaulted container "ais" out of: ais, populate-env (init)

aisnode proxy container startup at Fri Aug 25 09:32:36 UTC 2023

'/var/ais_config/ais.json' -> '/etc/ais/ais.json' '/var/ais_config/ais_local.json' -> '/etc/ais/ais_local.json' '/var/statsd_config/statsd.json' -> '/opt/statsd/statsd.conf' /ais_docker_start.sh: line 13: node: command not found No cached .ais.smap aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -allow_shared_no_disks=false -ntargets=2 E 09:32:38.777314 daemon:151 FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: , error found in #10 byte of ...|rol": "", "p|..., bigger context ...| "51080", "port_intra_control": "", "port_intra_data": "" } }|...

FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: , error found in #10 byte of ...|rol": "", "p|..., bigger context ...| "51080", "port_intra_control": "", "port_intra_data": "" } }|... cat: /var/log/ais/aisnode.INFO: No such file or directory cat: /var/log/ais/aisnode.ERROR: No such file or directory cat: /var/log/ais/aisnode.WARNING: No such file or directory

By using the default admin and aisnode image (3.4), the pods runs successfully. Is there a mistake in my deployment or are the new images not yet compatible with k8s? Thanks in advance!

aaronnw commented 1 year ago

Thanks for posting the issue!

The problem you're seeing is due to the empty values in the config for:

"port_intra_control": "",
"port_intra_data": ""

In a typical deployment these are set here via environment variables and have defaults if those are not set:

https://github.com/NVIDIA/aistore/blob/cad5b960d40336490a21f447c0a51216379164a5/deploy/dev/local/aisnode_config.sh#L188C10-L188C10

    "host_net": {
        "hostname":                 "${HOSTNAME_LIST}",
        "hostname_intra_control":   "${HOSTNAME_LIST_INTRA_CONTROL}",
        "hostname_intra_data":      "${HOSTNAME_LIST_INTRA_DATA}",
        "port":               "${PORT:-8080}",
        "port_intra_control": "${PORT_INTRA_CONTROL:-9080}",
        "port_intra_data":    "${PORT_INTRA_DATA:-10080}"
    },

However, our terraform scripts still use the somewhat outdated helm scripts here https://github.com/NVIDIA/ais-k8s/tree/master/helm/ais which don't set those values when generating the config, hence the parsing error.

Since you've already got the k8s cluster running, I would suggest trying to deploy directly with the k8s operator https://github.com/NVIDIA/ais-k8s/blob/master/operator/README.md as that is fully compatible with more recent versions.

Hastyrush commented 12 months ago

Hello,

Thanks for the clarification.

I went ahead to try and deploy the Kubernetes operator after using the Terraform script to deploy kubernetes (./deploy.sh k8s)

I used the command IMG=aistore/ais-operator:latest make deploy in the documentations and managed to deploy the cert-manager as well as the ais-operator-system namespace. However, when running _kubectl apply -f config/samples/ais_v1beta1aistore.yaml -n ais-operator-system, the container images are unable to start. I will paste the errors encountered below.

kubectl logs -n ais-operator-system aistore-sample-proxy-0 Defaulted container "ais-node" out of: ais-node, populate-env (init)

aisnode proxy container startup at Wed Aug 30 08:02:49 UTC 2023

'/var/ais_config/ais.json' -> '/etc/ais/ais.json' '/var/ais_config/ais_local.json' -> '/etc/ais/ais_local.json' '/var/statsd_config/statsd.json' -> '/opt/statsd/statsd.conf' /ais_docker_start.sh: line 13: node: command not found No cached .ais.smap aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -allow_shared_no_disks=false -ntargets=1 E 08:02:51.405993 daemon:151 FATAL ERROR: failed to load initial global config "/etc/ais/ais.json": cmn.ClusterConfig.Mirror: cmn.MirrorConf.ReadObject: found unknown field: util_thresh, error found in #10 byte of ...|ilthresh":0,"burst|..., bigger context ...|{"backend":null,"mirror":{"copies":2,"util_thresh":0,"burst_buffer":512,"optimize_put":false,"enable|...

FATAL ERROR: failed to load initial global config "/etc/ais/ais.json": cmn.ClusterConfig.Mirror: cmn.MirrorConf.ReadObject: found unknown field: util_thresh, error found in #10 byte of ...|ilthresh":0,"burst|..., bigger context ...|{"backend":null,"mirror":{"copies":2,"util_thresh":0,"burst_buffer":512,"optimize_put":false,"enable|... cat: /var/log/ais/aisnode.INFO: No such file or directory cat: /var/log/ais/aisnode.ERROR: No such file or directory cat: /var/log/ais/aisnode.WARNING: No such file or directory

Same error for the ais-sample-target-0 pod as well. Seems like the 'util_thresh' field in the json config file is not recognized. Am I using the wrong image? The _ais_v1beta1aistore.yaml is as follows:

image

Thanks a lot for the help!

alex-aizman commented 12 months ago

we just made a release aistore v3.19 and operator v0.9.5, maybe try again, let us know

Hastyrush commented 11 months ago

Hello, thanks for the reply.

I tried pulling the latest aistorenode and operator version, but I'm still getting the same error 'unknown field: util_thresh' image

This is the config file that I used image

alex-aizman commented 11 months ago

This util_thresh knob was removed almost 18 months ago:

By the time, we'd already stopped using helm and transitioned to Kubernetes operator. I've just (belatedly!) added a text and a warning at https://github.com/NVIDIA/ais-k8s.

Hastyrush commented 11 months ago

Got it, thanks!