[Question]: Unexpected pod disruptions of custom redis module

Panfactum / stack

The Panfactum Stack

https://panfactum.com

Other

8 stars 3 forks source link

[Question]: Unexpected pod disruptions of custom redis module #22

Closed jlevydev closed 1 week ago

jlevydev commented 1 week ago

Prior Search

[X] I have already searched this project's issues to determine if a bug report has already been made.

What happened?

We deploy a first party module that involves a similar Redis deployment to the Sentinal deployment Panfactum has and utilizes the same underlying chart. On our Master instance we have a Pod Disruption Budget with maxUnavailable = 0. However we consistently see this Pod rescheduled leading to an outage in our application's ability to communicate with Redis.

We added the PDB to our Redis deployment at 11:00AM EST Thursday and observed four disruptions to that deployment throughout Thursday and Friday:

11:24AM Thursday for 55 seconds
6:01PM Thursday for 5 seconds
6:39PM Thursday for 5 seconds
6:51PM Friday for 25 seconds

In 3 out of the four scenarios the failure was not long enough to trigger a fail-over in a HA Sentinal deployment and each time when inspecting what had happened in the cluster it seemed the pod was rescheduled to a new node managed by Karpenter. Currently it seems the pod has made a "stable" home on one of the dedicated instances not managed by Karpenter, but I'd like to understand whether this behavior is the expected behavior since Karpenter does not seem to be respecting the PDB.

See also the Karpenter docs discussing Disruption Budgets. It has language about Pods that "fail to shut down" which makes me think Karpenter will attempt to evict Pods and back off/retry if the Pods themselves are not successfully evicted.

TLDR: Karpenter is taking down my Redis cluster :(

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

fullykubed commented 1 week ago

Hi @jlevydev! Thanks for the outreach and sorry to hear you are having issues with Redis.

A few observations:

(1) It sounds like you are using your own custom Redis implementation and not the one provided by Panfactum. We'd encourage you to use the one provided by Panfactum as it is known to work. Can you try that and report back?

(2) Karpenter simply uses the Kubernetes eviction API when de-scheduling pods. The eviction API will never delete a pod if a PDB doesn't allow it. That behavior is built-in to core Kubernetes. You linked to the docs, and I just want to be sure you didn't miss this:

Karpenter respects Pod Disruption Budgets (PDBs) by using a backoff retry eviction strategy. Pods will never be forcibly deleted, so pods that fail to shut down will prevent a node from deprovisioning.

I think you might be jumping to conclusions about the PDBs not working with Karpenter. Checking the PDBs is a good intuition, but I can 100% confirm that the version of Karpenter we use will always respect your PDBs if you set them.

(3) This statement is a bit confusing to me: "In 3 out of the four scenarios the failure was not long enough to trigger a fail-over in a HA Sentinal deployment." You also state that you are using a module that is similar to Panfactum's... but this is not how Sentinel works in Panfactum. When a Redis master is gracefully terminated in the Panfactum module (such as during a normal eviction), sentinel will proactively switchover the master replica before the old master pod shuts down. There is no delay. Moreover, there is no "failure" that needs to be "long enough" unless the redis master is unexpectedly terminated or separated from the cluster (e.g., OOM, underlying hardware issue, network partition, etc.).

Ultimately, it sounds like there is something else happening here. Are you sure your pods aren't failing for some other reason? How did you know this is Karpenter related? Alternatively, can you provide a way to reproduce the issue you are seeing using a Panfactum redis module?

(4) You say ,"Karpenter is taking down my Redis cluster," but this is a bit vague. Are you saying there is a temporarily unavailable master, are you saying the entire cluster including the replicas are unavailable, are you saying the cluster is permanently downed and requires manual intervention?

(5) Can you share how you are connecting with redis? Perhaps post the connection configuration? You must use a sentinel-aware client if want 0 downtime failovers. It doesn't sound like you are doing that since you are experiencing temporary network disruptions.

fullykubed commented 1 week ago

I am also wondering if you are running your redis master on on-demand or spot instances?

If you are running on spot instances AND you have a PDB that prevents disrupting the master, then your master redis pods cannot be gracefully terminated. Ironically, this is because Karpenter is respecting your PDB, not ignoring it.

In this case, a spot disruption would essentially crash your redis master, resulting in a short amount of downtime. This is something that is prevented in the Panfactum module, but again, it sounds like you are not using it?

jlevydev commented 1 week ago

Hi @fullykubed, thanks for all the great feedback!

For some context, we are currently in the process of porting our previous non-Panfactum Redis set up into our Panfactum cluster. That set up was a single non-HA Redis Pod and you're correct that our application code used a non-Sentinal aware Client to connect to Redis.

Our current goal is to port our Redis set up into our Panfactum clusters and we're having a number of conversations internally about the best way to do that. We didn't use the Panfactum module since there are some diffs (specifically around AOF fsync timing) that we're not sure we'd like to remove. We're also still just learning about Sentinal and all the bells and whistles involved, so still in the learning/experimentation phase right now.

An outage happened again at 3:00AM last night (see some relevant logs below). From inspecting the logs I think that you're correct that neither Karpenter nor the Descheduler are responsible for the Master being rescheduled. I'm admittedly still a little confused why it's getting rescheduled then since in our old cluster we'd have a Redis instance persist for months. To be clear the Redis Pod is currently on the cluster's dedicated compute that is not managed by Karpenter. redis.csv descheduler.csv karpenter.csv monolith.csv

From the Redis logs it does seem that Redis is gracefully shutting down, so I think switching to a Sentinal aware Client may alleviate the outages we're seeing. We were trying to avoid changes to the application code in this migration but seems necessary. Going to implement that so I think this "issue" could be considered resolved once I validate, but generally curious if you have any ideas why this Pod would be rescheduled despite the PDB?

fullykubed commented 1 week ago

Thanks for the additional info. Let me address a few of the items:

(1) If you are only using a single redis instance with no replication, you do not need sentinel as sentinel is only for HA deployments.

(2) If you have the PDB set up properly and the underlying node was not disrupted, I cannot think of any reason why a healthy pod would be disrupted by something in the Panfactum stack. Every single component of the Panfactum stack uses the eviction API to remove Pods as that is the standard. As a result, a PDB would prevent terminations. We have pods in our systems running for multiple weeks without disruption, so I am quite confident it is possible to accomplish what you are looking for.

If you are 100% sure there was not an issue with the underlying nodes, you can search the kubernetes API server logs to see if anything issued a deletion request (instead of an eviction request). That would pinpoint definitively if something is misbehaving.

(3) One other thing to consider is that if you are experiencing resource pressure on your nodes, the kubelet is empowered to remove pods when necessary. Perhaps that is something to look into?

Additionally, pods can be gracefully terminated if they remain unhealthy or unready for too long. Have you verified that your redis pods were actually healthy at the time of the shutdown?

(4) While I understand that this was running without disruption previously, I think 99.9% uptime is about the best you can expect in a non-HA setup. I'd recommend using an HA deployment which would make arbitrary disruptions a non-issue.

(5) You mention not being able to run the Panfactum-provided module, but I think that is likely your best bet. If you share your redis.conf, we can add support for the settings you find important.

The Panfactum module resolves dozens of sharp edges when running the default helm charts, and it seems like you must just be running up against something we have already solved (even though the exact root cause still remains unclear).

jlevydev commented 1 week ago

To clarify we have shifted to a HA Sentinel set up but our workloads are still using the non-Sentinel aware Client which I believe is the root cause of our observed downtime.

The specific configuration we would want to add is appendfsync always since it's a priority in our organization that acknowledged writes are not dropped. We're willing to make the performance trade off associated with that. Maybe just accepting a list of extra args would be good practice so that people can add configuration as needed?

We're currently in the testing/learning phase working with Sentinel and understanding the ways in which we trade off risks of Data Loss for HA. A question we've had internally is "Does replicating our data in a HA manner mitigate the risk and reduce the need for the always AOF?".

That said happy to fold the Panfactum Sentinel set up into our testing tomorrow as we change our workloads to the Sentinel aware clients and report back!

On another note, I am still confused why we are seeing evictions. We currently don't have API Server logs, would the correct logs to turn on in the aws_eks module be api and scheduler? I still think we're observing what could be a bug and would like to get a definitive root cause. I am certain that the Redis Pod that was evicted was healthy ahead of its eviction as was its Node.

Thanks for all the help so far, very informative!

jlevydev commented 1 week ago

Also if I'm not mistaken we could use Vault to issue dynamic credentials for these workloads, would you have any docs you could point us to in order to use this mechanism rather than static credentials? Any guidance would be helpful!

fullykubed commented 1 week ago

We can add the ability to add extra redis configuration options to the Panfactum module today. I will post here once that has been tested and shipped.

You mention that data durability is a priority. Be aware that redis is not designed for durability. Critically, replication only occurs on an asynchronous basis (no way to change this). appendfsync always will only ensure that acknowledged writes are synced to a single node. In AWS, a gp3 EBS volume is only 99.8% durable (we recommend at least 99.999% for business critical data). While it is highly unlikely that you would lose your entire dataset if you are replicating it, your system design should account for losing arbitrary records. Moreover, reading from a replica means you are always running the chance of receiving stale data. Depending on your use case, we'd recommend either PostgreSQL or Kafka if you need a permanent system of record.

If you do proceed with our Redis module, you will want to ensure the following to achieve maximum uptime:

We'd strongly recommend against installing a permanent PDB preventing disruptions of the master. It is better to allow graceful terminations than not allow terminations in the case of common operations like node shutdowns. Moreover, you need to be able to occasionally evict the master in order to apply security updates. In our module, failover is immediate; there is no delay.
Your clients must be sentinel-aware if you want them to be able to immediately detect a change in masters.
You clients should have client-side retries enabled. We'd recommend retrying for at least 30 seconds before triggering a failure warning.
Currently, we only support dynamic credentialing for human users. Dynamic credentialing for services is still in alpha state. That said, we do include periodic root credential rotation so the security delta is not that large. The credentials are supplied in the output of the Panfactum redis module. (Example)

As for identifying the cause of the issue you were seeing, it is clear there is something unexpected occurring. Here is what I know so far:

You didn't experience disruptions previously, but two things are changing at once: you are moving onto the Panfactum stack and you are also in the process of experimenting with different settings in your module.
Setting aside redis, we do not have a simple example that replicates the issue. We are able to run long-lived trivial pods on the Panfactum reference architecture.
We don't have any logs or metrics pointing to anything that would implicate the stack vs something more benign like spot instances or resource contention in a node.

While it is absolutely possible there is a problem in the stack that is triggered by your setup, I think we'd want something more definitive before jumping to that conclusion. Here is the information I would examine to try to pinpoint a root cause:

API server audit logs (api in aws_eks)
Kubelet logs from the node where the pod was disrupted
Kubernetes events
Redis pod metrics (healthchecks, resource usage, restarts)
Node metrics from the node where the pod was disrupted (healthchecks, resource usage, restarts)

Analyzing these data in the time leading up to the disruption will provide more information about what specifically occurred and why. From there, I think we should be able to make a conclusion about the root cause.

jlevydev commented 1 week ago

Appreciate the guidance. We'd like to minimize data loss in our architecture but we understand the risks inherent with Redis as a store of data and are comfortable with that. We primarily use it as the basis for our Queuing system so we only use commands that write to the DB. As a result the stale data from replicas risk is not an issue. If we lost our entire Redis cluster it would be inconvenient but not catastrophic.

We are also aware that you can't set Sentinel to synchronously replicate data which was a consideration that lead to our original non-HA set up using fsync always. I am not sure we'll use the fsync always option in a HA set up since there's no guarantee that the write to disk is replicated and if the master is restarted for any reason it's likely that Sentinel will promote a replica (if I'm understanding the fail over set up correctly).

I think over the next week we'll be doing our own testing to get a sense of how quickly replication occurs and to get an understanding of data loss in different scenarios like the graceful shutdown of a master or a fail over.

I looked back again at the windows in time and for each one the Node's were under a large amount of Memory pressure. I didn't realize until some reading this morning that evictions for Node pressure don't respect PDBs. Going to still look out for any other issues, but it seems likely that this was as a result of Node pressure. Thanks again for the continued support!

fullykubed commented 1 week ago

The redis module has been updated to include the ability to pass arbitrary configuration options.

Yes, there will be no way to eliminate the possibility of data loss with redis while also making it 100% available. appendfsync everysec is my personal recommendation as it will not introduce any performance bottleneck (esp when using networked drives) while also giving very reasonable durability.

As far as replication timings go, there is no data loss during a graceful shutdown. There is a synchronization event that occurs when the master role is handed off. Data loss would occur during non-graceful shutdowns which should be incredibly rare if NOT using a PDB for the master. We have not experienced a single one despite running redis instances on spot instances without any disk storage for months at a time.

As far as the issue you were seeing goes, I am glad you were able to pinpoint the root cause. As you saw, nodes only have finite resources and if pods exhaust them they will need to be rescheduled in order to avoid a total node crash. This preemption is done in order of priority class. Note that resource contention on another node might even cause pods on other nodes to be removed to make room for rescheduling higher priority pods.

This is one of many ways that a pod might be disrupted even when using PDBs. Almost always this is in order to prevent an even greater disaster from occurring (i.e., all the pods on the nodes OOMing).

This is why it is critical to (a) appropriately set resource requests and allocate appropriate overhead and (b) run HA services. On the Panfactum stack, when using Panfactum-provided modules, this is all taken care of for you, but when you are developing your custom logic, you will need to account for it. Importantly, again, do not set PDBs that prevent a pod from ever being evicted as otherwise the Panfactum stack cannot dynamically update resource requests based on changing requirements.

jlevydev commented 1 week ago

That's all super helpful. Does Redis make guarantees about replication in a graceful shutdown scenario? From my reading it seemed like there was a 10 second grace window for replicas to catch up but they didn't make explicit guarantees for replication on shutdown.

I've made the changes in the service to use a Sentinel aware client however when I go to deploy the module I seem to get an error. See the error below as well as my existing module configuration:

Screenshot from 2024-05-06 17-24-52

`source = "github.com/Panfactum/stack.git?ref=c2b59cd27d71b35d5cad83e2a23901cd3fbedd5a/packages/infrastructure//kube_redis_sentinel" pull_through_cache_enabled = true namespace = local.namespace redis_flags = ["--appendfsync", "everysec"] persistence_enabled = true

environment = var.environment pf_root_module = var.pf_root_module region = var.region is_local = var.is_local pf_stack_version = var.pf_stack_version pf_stack_commit = var.pf_stack_commit`

fullykubed commented 1 week ago

There are absolutely no explicit guarantees about data durability when using redis. You should plan on losing records if you are using redis, regardless of how it is deployed. Even when using WAIT, redis will absolutely still lose data in some scenarios (docs).

However, in our setup we pause writes temporarily when gracefully failing over to allow the system a few moments to synchronize and prevent data loss (see redis docs). In the normal course of operations, data loss is unlikely (but again, always possible).

As for your current issue, you need to ensure you are running the latest version of the stack for all your modules. It looks like you have an out-of-date service mesh. Please check the discord for announcements on changes in the edge release channel.

fullykubed commented 1 week ago

@jlevydev You mentioned on our last call that this issue was resolved by updating the stack to the latest version. Closing this issue.