NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.27k stars 173 forks source link

DNS name in DirectURL in smap #70

Closed and-1 closed 3 years ago

and-1 commented 3 years ago

Hi everyone. What about change electable proxy manifest to StatefulSet and use dns name of pod instead ip in directurl in smap? IP is ephemeral resource in k8s.

VirrageS commented 3 years ago

Hey!

Thanks for the proposal but we would need more information about the problem it would solve. Could you expand your idea a little bit more? Do you have any problem with current way of deployment?

Here are some our thoughts about this:

Thanks! Janusz

and-1 commented 3 years ago

I experienced with few problem:

  1. Simultaneously down/restart all electable proxy. All ip of proxies were changed and target node couldn't connect to it.
  2. I tried use proxy ID = pod name (sts). I restart one proxy, after this one is up, i run ais show cluster smap and it take long time to get result. In proxy log i see timeout to connect to old ip of proxy. Old ip not present in smap, think some kind of cache.

If change proxy manifest to StatefulSet we will get predictable pod name and direct dns name to pod. Also we can determine primary node in bootstrap phase based on pod index, not use node label. It's more convenient.

VirrageS commented 3 years ago

Simultaneously down/restart all electable proxy. All ip of proxies were changed and target node couldn't connect to it.

This is some limitation that we currently have that targets will not able to correctly function if all the proxies die. For now, we don't have any logic on the target that would try to reconnect or rediscover the new/restarted proxies.

I tried use proxy ID = pod name (sts). I restart one proxy, after this one is up, i run ais show cluster smap and it take long time to get result. In proxy log i see timeout to connect to old ip of proxy. Old ip not present in smap, think some kind of cache.

This sounds like a bug. The node information should be updated on the restart/rejoin of the proxy. We will look into it. Do you use any particular commit/tag?

Also we can determine primary node in bootstrap phase based on pod index, not use node label.

I think the node label was introduced because we specifically wanted to have a primary on some particular node. So it's kind of a "feature". But I agree, the less we require the easier it is to setup.

and-1 commented 3 years ago

This is some limitation that we currently have that targets will not able to correctly function if all the proxies die. For now, we don't have any logic on the target that would try to reconnect or rediscover the new/restarted proxies.

Move to dns in directurl does not resolve this problem?

This sounds like a bug. The node information should be updated on the restart/rejoin of the proxy. We will look into it. Do you use any particular commit/tag?

3.2.1

VirrageS commented 3 years ago

Move to dns in directurl does not resolve this problem?

It potentially could but it's not obvious if this would work every time. The target tries to connect to the proxy that is next "in line" in terms of primary. But I'm not really sure what would happen if we kill all the proxies and only start some of them. Would targets rejoin correctly? Probably not.

In general the case when all proxies dies is abnormal and for now we don't have a clear solution for the targets. We need to think about it.

3.2.1

Oh, this would explain it. We recommend trying 3.3. We have tremendously improved the various cases with target and proxy restart so this issue should be gone when upgraded to this version. Let us know if you stumble upon any issues!

and-1 commented 3 years ago

Oh, this would explain it. We recommend trying 3.3. We have tremendously improved the various cases with target and proxy restart so this issue should be gone when upgraded to this version. Let us know if you stumble upon any issues!

I can't reproduce this behavior on 3.3. Thx!

It potentially could but it's not obvious if this would work every time. The target tries to connect to the proxy that is next "in line" in terms of primary. But I'm not really sure what would happen if we kill all the proxies and only start some of them. Would targets rejoin correctly? Probably not.

Suppose we use statefulset and promote to primary proxy with index 0 (for ex. ais-proxy-0) in boostrap time. When down all proxy and up, target will try connect to next "in line" proxy and failback to ais-proxy-0

VirrageS commented 3 years ago

Suppose we use statefulset and promote to primary proxy with index 0 (for ex. ais-proxy-0) in boostrap time. When down all proxy and up, target will try connect to next "in line" proxy and failback to ais-proxy-0

That would be exactly the idea. We just need to evaluate the whole implementation and work out the bits and pieces - it may take some time though :(. Thanks so much for the ideas and feedback!

grmaltby commented 3 years ago

In the direct/discovery etc url dance I am reasonably sure the config I left preferred to probe the proxy clusterIP service first and foremost - if any proxy is serving as an endpoint behind that clusterIP service then it would respond with an smap allow target registration to proceed. So if we stop and restart all proxies (power outage scenario) the expectation was that they'd find a cached smap and workout that this is not initial deployment and proxies would cluster and form endpoints behind the clusterIP, at which time targets could begin to retrieve an smap. The weakness in all that was that the responding endpoint provides an smap that identified the primary by IP, and if that IP is no longer valid (killed the primary again) then the target fails registration. But the fallback there was that the target container would exit and restart, hoping to have better luck with registration timing on the restart.

I'm not convinced that returning a DNS name really helps - if the pod dies then it still won't respond.

saiprashanth173 commented 3 years ago

Added support for DNS name in DirectURL (in master commit f2112c1).