NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

AIStore Prod/K8s Clustering Implementation and Issues #139

Closed bboychev closed 1 year ago

bboychev commented 1 year ago

Dear AIStore Team,

Thank you very much for your efforts for developing AIStore product. It looks great! I am interested in it from some time and am investigating its capabilities. I have deployed successfully two separate Large-scale production deployment (K8s) AIStore clusters 3.18.7081e29 on Ubuntu 22.04 virtual machines with K8s v1.25.2 following the documentation and ais-k8s. I have used aistorage/ais-operator:0.94 for the purpose as I think read somewhere there to use that method. The other docker images are aistorage/ais-init:latest and aistorage/aisnode:3.18. I am using latest flannel as K8s network plugin if that matters. The deployment went fine with the exception that I needed to patch the aistore-proxy statefulset livenessProbe and readinessProbe like that: kubectl -n ais patch statefulset/aistore-proxy -p '{"spec": {"template": {"spec": {"containers": [{"name": "ais-node","livenessProbe": {"timeoutSeconds": 10},"readinessProbe": {"timeoutSeconds": 10}}]}}}}'. I have just deployed one AIS proxy and one AIS target per K8s cluster. I have deployed default ingress-nginx ingress controller with that ingress:

kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-body-size: 500m
  labels:
    app: aistore
    component: proxy
    function: gateway
  name: aistore-proxy
  namespace: ais
spec:
  rules:
  - host:
    http:
      paths:
      - backend:
          service:
            name: aistore-proxy
            port:
              number: 51080
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    secretName: example-tls-cert

So, I am trying to make a Production ready deployment. I am investigating AIStore scaling and clustering capabilities following available documentation:

Please correct me if I am wrong but I do see these options to scale up the deployment: 1) Increase storage size - increase configured disks by size and extend the filesystems. Any other options to achieve that with adding additional disk/s (e.g. maybe using Linux LVM)? 2) Remote attach using ais cluster remote-attach ... 3) Join proxy/target node using ais cluster add-remove-nodes ... 4) Join new K8s worker (with additional disks) into existing K8s clusters

The two virtual machines aisbox-3 and aisbox-4 are in same network segment 192.168.121.0/24 with IP address 192.168.121.185 for aisbox-3 VM and 192.168.121.154 for aisbox-4 VM. The logical networks I suppose are:

My tests showed that and so my follow up questions in regards to scalability: 1) Increase storage size - works fine - I am using LVM, so that's pretty clear. It will be probably fine with using standard disk devices as well. I suppose I need to increase PV and PVC capacity size as well. Any other options to increase the storage size for new objects using AIStore K8s deployment (I do not need new replica disks)? 2) Remote attach using ais cluster remote-attach ... command works in general but I was not able to get/put object on a remote bucket, so it does not work from functionality point of view, so it failed, see attached markdown file. Do we need some special K8s overlay network configuration? ais_prod_k8s_playground_remote-attach_tests.md 3) Join proxy/target node using ais cluster add-remove-nodes ... failed for me. I have tried to attach a proxy node on aisbox-3 VM using the available one under aisbox-4 (through ingress controller) - failed, see attached markdown file. Do we need some special K8s overlay network configuration or special AIStore intra-cluster control/intra-cluster data configuration? ais_prod_k8s_playground_join_tests.md 4) Actually I haven't tested joining a new K8s worker (with additional disks) as I am not sure what I need to bring on it, an AIS proxy or target or both. I do not want new replicas of existing objects on that worker but possibility to have new objects. So, I am little bit confused here how to implement it.

So, could you please help me understand where are my problems in that Prod/K8s setup? (the results raised major concerns in AIStore Prod/K8s Clustering Implementation and Scalability capabilities for me)

Best Regards, bboychev

alex-aizman commented 1 year ago
Error: p[eRWCTOWd]: duplicate IPs: p[eRWCTOWd] and p[KHWaDVhn] share the same "aistore-proxy-0.aistore-proxy.ais.svc.cluster.local:51081"

Each node in a cluster must have its own IP:port. Deployment error. Closing.