NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

AIStore Dev Clustering Implementation and Issues #138

Closed bboychev closed 1 year ago

bboychev commented 1 year ago

Dear AIStore Team,

Thank you very much for your efforts for developing AIStore product. It looks great! I am interested in it from some time and am investigating its capabilities. I have deployed successfully two separate Local playground (Dev) AIStore clusters 3.18.7081e29 on Ubuntu 22.04 virtual machines following the documentation. I am investigating AIStore scaling and clustering capabilities following available documentation:

Please correct me if I am wrong but I do see these options to scale up the deployment: 1) Increase storage size - increase configured disks by size and extend the filesystems. Any other options to achieve that with adding additional disk/s (e.g. maybe using Linux LVM)? 2) Remote attach using ais cluster remote-attach ... 3) Join proxy/target node using ais cluster add-remove-nodes ...

The two virtual machines aisbox-1 and aisbox-2 are in same network segment. The logical networks are:

My tests showed that and so my follow up questions in regards to scalability: 1) Increase storage size - works fine - I am using LVM, so that's pretty clear. It will be probably fine with using standard disk devices as well. Any other options to increase the storage size for new objects (I do not need new replica disks)? 2) Remote attach using ais cluster remote-attach ... works fine for me as these VMs are in same network segment. No questions here. 3) Join proxy/target node using ais cluster add-remove-nodes ... failed for me. I have tried to attach proxy and target nodes on aisbox-1 VM using the available ones under aisbox-2 - failed, see attached markdown file. ais_dev_local_playground_join_tests.md

So, could you please help me understand where are my problems in that Dev setup? (the results raised major concerns in AIStore Dev Clustering Implementation and Scalability capabilities for me)

Best Regards, bboychev

alex-aizman commented 1 year ago
FATAL ERROR: [cluster integrity error cie#40, for troubleshooting see https://github.com/NVIDIA/aistore/blob/master/docs/troubleshooting.md]: BMDs have different UUIDs: (t[ZKtt8089], BMD v2[WLwxcIZQN, buckets: ais(1), cloud(0)]) vs (p[MEUp8080], BMD v2[A0iUOQKn8x, buckets: ais(1), cloud(0)])

In other words, you are trying to join a node that is (or was) a member of the cluster UUID=WLwxcIZQN to another cluster that has a different unique ID A0iUOQKn8x. That's not allowed for obvious reasons.

bboychev commented 1 year ago

@alex-aizman : Does that mean that we cannot join a proxy/target node from one cluster to another cluster? What is the reason in that case to provide 192.168.121.154:8089 in command ais cluster add-remove-nodes join --role=target 192.168.121.154:8089 for example? Should both cluster have same UUID (how to achieve that)? I am pretty confused how it is supposed to work.

alex-aizman commented 1 year ago

Each cluster has its own UUID and its own set of nodes. Closing.