devsisters / shardcake

Sharding and location transparency for Scala
https://devsisters.github.io/shardcake/
Apache License 2.0
382 stars 29 forks source link

recycled pod IP address causes rebalance failure #107

Closed dpennell closed 3 months ago

dpennell commented 5 months ago

In one of our small-ish dev clusters, the pod ip address of one of the sharding pods was recycled to another pod that was not part of the shardcake cluster.

We were able to resolve this by restarting the pod with the recycled ip address.

We have decreased rebalanceInterval and rebalanceRetryInterval in an attempt to reduce the likelihood of this happening again.

I think it would be good to use the pod uid as pod identity.

ghostdogpr commented 5 months ago

When you configure selfHost in com.devsisters.shardcake.Config, you don't have to use the IP address, you can use the hostname instead.

dpennell commented 5 months ago

Doesn't this cache lookup require the use of an IP Address? Some(FieldSelector.FieldEquals(Chunk("status", "podIP"), podAddress.host))

ghostdogpr commented 5 months ago

Indeed, you're right. I wonder if there is a way to query k8s by host name instead 🤔

ghostdogpr commented 5 months ago

I haven't tested but this might work if host is the pod name?

pods
  .getAll(config.namespace)
  .filter(_.metadata.flatMap(_.name).contains(podAddress.host))
  .runHead
  .map(_.isDefined)
dpennell commented 5 months ago

I think that should work. It will pull info for all the pods in a namespace. I don't know what the impact of this is relative to using a fieldSelector. There aren't many choices for field selectors: https://hoelz.ro/blog/which-fields-can-you-use-with-kubernetes-field-selectors

Another alternative is to configure both hostname and IP address. Use a podIP field selector and then filter by hostname.

dpennell commented 5 months ago

I tested your suggestion and it does not work. This is because k8s pods don't get DNS entries, so you end up with "unresolved host" exceptions

ghostdogpr commented 3 months ago

I had the same issue in a test environment today 😄 I think a good way to resolve it is using labelSelector. You can use labels to filters out IPs that have been reused by other services. I will expose this field.