NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.23k stars 166 forks source link

Can't deploy cluster in k8s #187

Open liuzhiyuan562 opened 1 week ago

liuzhiyuan562 commented 1 week ago

I use the command (kubespray-venv) zy@zy-ThinkPad-X1-Carbon-Gen-11:~/kubernetes/kubespray/ais-k8s$ ansible-playbook -i playbooks/hosts.ini playbooks/ais-deployment/ais_deploy_cluster.yml -K -e cluster=ais and I met a question and I can't find a solution. Image

the /tmp/ais.yaml is below.(I change the name to /tmp/ais.log, because .yaml isn't allowed) ais.log

aaronnw commented 4 days ago

Please check that you are using the latest operator and aisnode versions. stateStorageClass was added in operator v1.2.0 https://github.com/NVIDIA/ais-k8s/releases/tag/v1.2.0 -- it looks like your current operator does not have the latest AIS custom resource definition.

aaronnw commented 4 days ago

Alternatively, hostPathPrefix is still supported for AIS node state storage (though deprecated). If you provide this option, you can specify a hostPath for local state to be stored on rather than a dynamically configured local volume. (note stateStorageClass will take precedence if both are provided)

liuzhiyuan562 commented 3 days ago

Thanks! And I met another question when I set up a debugging pod Image Image

liuzhiyuan562 commented 3 days ago

And I change the command in aisnode_debug.yaml Image

It still error Image

aaronnw commented 2 days ago

We switched recently to using distroless images for aisnode and as a result they no longer have built-in utilities like tail or even a shell. Assuming you've got connectivity from your host, you can just install the ais CLI or use the python SDK directly without a debug pod and use your k8s node for your AIS_ENDPOINT.

We'll need to update the docs regarding aisnode-debug so thanks for bringing that up!

liuzhiyuan562 commented 2 days ago

Thank you very much for your answer!

liuzhiyuan562 commented 2 days ago

And I met another question when I deploy authn service.(I use the command ~/kubernetes/kubespray/ais-k8s$ ansible-playbook -i playbooks/hosts.ini playbooks/ais-deployment/ais_deploy_authn.yml -K -e cluster=ais) I don't know why signature is invalid. Image

gaikwadabhishek commented 18 hours ago

Hey @liuzhiyuan562

I think your singing keys for AIStore and AuthN Server are different.

check your signing key - $ kubectl get secret jwt-signing-key -n ais -o jsonpath="{.data.SIGNING-KEY}" | base64 --decode ; echo aBitLongSecretKey

paste your token over here and verify from your signing key that the signatures match (the content/payload should exactly match) - https://jwt.io/

Did you while deploying mention in the CRD that your JWT signing key is in this secret? To verify - kubectl describe pod ais-proxy-0 -n ais | grep jwt

You should see some secret mounted as env var

gaikwadabhishek commented 16 hours ago

I have added a fix for the aisnode-debug pod as well