NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.23k stars 164 forks source link

Encountered an issue while upgrading the aisnode version on k8s cluster(GCP) #97

Closed KavyaPuranik closed 2 years ago

KavyaPuranik commented 2 years ago

Hi,

I was following https://github.com/NVIDIA/ais-k8s terraform style of deploying the ais application on to k8s cluster. I initially deployed the cluster with 3.4 version, which was successful & then when I tried to upgrade it to 3.8 version(using the same script), the pods went to crashloopbackoff status. I had to manually clean the state & config files to make it work for 3.8 version. Any inputs on how to resolve this issue without having to clean the files manually?

saiprashanth173 commented 2 years ago

Hi @KavyaPuranik,

May I get some details on your use-case?

I am assuming you are using https://github.com/NVIDIA/ais-k8s/blob/master/terraform/deploy.sh script to bring up a K8s cluster on GCP, followed by AIS deployment. This script is a bit outdated and uses deprecated helm charts to deploy AIStore. Preferred way to deploy AIStore is to use the operator approach https://github.com/NVIDIA/ais-k8s/tree/master/operator. AIStore v3.4 and v3.8 are both deprecated and we don't support these versions anymore. We suggest you use release v3.10.

If you would like to deploy AIStore v3.10 version on GCP, you can try running:

$ ./deploy.sh k8s YOUR-ARGS
$ kubectl create clusterrolebinding cluster-admin-binding \
    --clusterrole=cluster-admin \
    --user=$(gcloud config get-value core/account)
$ kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.8.2/cert-manager.yaml
# Verify cert-manager pods are in Running state. Ref: https://cert-manager.io/docs/installation/kubectl/#prerequisites for more information.  
$ kubectl get pods --namespace cert-manager
# Deploy operator
$ kubectl apply -f https://github.com/NVIDIA/ais-k8s/releases/download/v0.92/ais-operator.yaml
# Ensure ais operator is running
$ kubectl get pods --namespace ais-operator-system

Once you have your operator pod running, you can apply the AIStore cluster manifest. You may use sample manifest https://github.com/NVIDIA/ais-k8s/blob/master/operator/config/samples/ais_v1beta1_aistore.yaml as reference and update the cluster name, size and any other specs as applicable.

apiVersion: ais.nvidia.com/v1beta1
kind: AIStore
metadata:
  name: <NAME> 
spec:
  # Add fields here
  size: <NUMBER-OF-TARGETS>
...
  targetSpec:
            …
    mounts:
      - path: "/ais1"
         size: <SIZE OF VOLUME>
           <LIST OF MORE VOLUMES> 

Please let me know if the operator approach and deploying v3.10 works for you. We will update the docs and scripts to use the operator approach.

KavyaPuranik commented 2 years ago

Hi @saiprashanth173,

I was able to successfully deploy the ais-node 3.10v using AIS-operators. Initially I wasn't aware that helm charts were outdated. So deployed the default version(3.4) and then tried to upgrade it, which caused the above mentioned issue.

saiprashanth173 commented 2 years ago

Thanks for checking @KavyaPuranik.

Closing this ticket. The docs in https://github.com/NVIDIA/ais-k8s repo will be updated to reflect latest releases.