fission / fission-workflows

Workflows for Fission: Fast, reliable and lightweight function composition for serverless functions
Apache License 2.0
371 stars 42 forks source link

Deleting old completed invocations #74

Open soamvasani opened 7 years ago

soamvasani commented 7 years ago

We need a way to either manually or automatically clean up old workflow invocation data.

erwinvaneyk commented 7 years ago

Related to #4

erwinvaneyk commented 6 years ago

Partially fixed in #103 - the workflow engine will no longer keep subscribed to channels of completed invocations.

TODO

soamvasani commented 6 years ago

If it's not affecting users and not breaking workflows, let's move this to a later milestone.

ghost commented 6 years ago

One of the problems we are hitting frequently is this and we resolved by workarounds that doesn't feel nice. Here is our feedback:

After nats is up and fission-workflows is under constant usage, nats starting to use more and more RAM (8-16 GB).

To solve this problem, we first go for file store type of nats but then we realized that it is still keeping everything in memory. So then we just decided to use memory store for some time because file store is impossible to clean up in the case of there is an error and nats restarted.

Our workaround for keeping memory usage under some boundary is using memory limits of k8s which restarts the NATS. However, this way we hit memory limit of 16GB too fast sometimes.

Then we realized default values in the nats config is not that good. We first make all limits unlimited because we don't want to hit them (like explained in #4):

          # Unlimited number of channels
          "--max_channels", "0",
          # Unlimited number of subscriptions per channel
          "--max_subs", "0",
          # Unlimited number of messages per channel
          "--max_msgs", "0",
          # Unlimited messages total size that can be stored per channel
          "--max_bytes", "0",

Then we also enabled deletions of inactive channels, sometimes it worked and clear the memory and sometimes not (after waiting more than 10 minutes, completed invocation channels are not cleared):

          # Max inactivity (no new message, no subscription) after which a channel can be garbage collected
          # We delete channels older than 10 minutes
          "--max_inactivity", "10m",

So, there maybe some subscriber somewhere sending nats a heartbeat or something that still keeps completed invocation channel active and don't let the channel to be deleted.

Then we also enforce deletition of messages older than 10 mins too because in our use case invocation should fail or success in 10 minutes anyway.

          # Max duration a message can be stored in a channel (the same as max inactivity)
          "--max_age", "10m",

I wish setting up max inactivity is enough to clear up old invocations but it is not somehow. Maybe this is a bug @erwinvaneyk but I am not sure.

At the moment, we have 2 nodepools. One is having big memory nodes and running nats, fission core and everything else outside of fission-function namespace to make sure it has enough resources for running critical pods. Other one is using smaller nodes to run fission functions themselves. BTW, fission workflow's engine pod is running together with big nodes because it couldn't scale horizontally right now too.

Also, as it mentioned in #4, we need more production ready nats. We probably need clustering (I am not sure if this is just replication of messages or distribution of them), partitioning (for distributing ram usage), persistency (for recovery from failure) features of nats to make sure all these memory and failure problems solved.

In that case, it means maintaining NATS instances and make sure they are running on bigger nodes with enough resources. I am sure that not everyone want to maintain a running message queue at scale, maybe adding cloud message queue (google pub/sub) support is less trouble free overall.

Thanks for reading our long feedback 👍

erwinvaneyk commented 6 years ago

Hi @thenamly - this is a great analysis. I am writing this while reading your analysis.

After nats is up and fission-workflows is under constant usage, nats starting to use more and more RAM (8-16 GB).

To solve this problem, we first go for file store type of nats but then we realized that it is still keeping everything in memory.

1) This might be an issue to raise with the NATS project. I would expect that using a persistency measure, regardless of FILE or SQL, it should decrease the pressure on the memory.

2) As I understand it, NATS by design fills up the memory it can get. Once at max memory it should start deleting the oldest messages. I think the trick is to set the limits in NATS a bit below the pod resource limits, to avoid OOMs and give NATS the opportunity to start garbage collection. Did you recognize this behavior?

I wish setting up max inactivity is enough to clear up old invocations but it is not somehow. Maybe this is a bug @erwinvaneyk but I am not sure.

This sounds like a bug in NATS streaming, you might want to file it there. As a mentioned above, you might want to try using the --max-bytes too. Set it a bit under the resource limits assigned to the NATS pods.

we need more production ready nats

It is an interesting issue, which you can attack in a couple of ways.

  1. dive deeper into NATS performance tweaking. You would probably need to talk to the NATS guys too.

  2. we write a google pub/sub implementation of the event store. However, I am not sure how 'low-latency' their pubsub service is, especially since in our case the publisher is often times the interested subscriber as well.

  3. another option we could investigate is, if fault-tolerance is not of utmost importance, we can improve the purely in-memory event store with a slower database implementation behind it that persists the events asynchronously. In this case you gain a lot performance and the backing event store is less of importance performance-wise, but you do risk loosing active flows in the case that the workflow engine crashes. It might be an idea to offer this as an option: if faster: true the engine does not wait for events to persist, if faster: false it requires full persistence of events before continuing (cc @xiekeyang and I have discussed adding this too)

What are your thoughts on these options?

ghost commented 6 years ago
  1. This might be an issue to raise with the NATS project. I would expect that using a persistency measure, regardless of FILE or SQL, it should decrease the pressure on the memory.

What I understand from their docs are: "memory/file/sql are simple interfaces. If you don't like, you can add yours"

As I understand it, NATS by design fills up the memory it can get. Once at max memory it should start deleting the oldest messages. I think the trick is to set the limits in NATS a bit below the pod resource limits, to avoid OOMs and give NATS the opportunity to start garbage collection. Did you recognize this behavior? ... As a mentioned above, you might want to try using the --max-bytes too. Set it a bit under the resource limits assigned to the NATS pods.

--max-bytes is per channel limit, not for all nats store. I thought it is for all nats store too. I tried to set this up 14GB while keeping k8s limit 16GB. Then I realized from docs that it is per channel limit. So you can only set memory limit to nats by channel_limit x max_bytes which is get tricky and feels like hack and inflexible.

  1. dive deeper into NATS performance tweaking. You would probably need to talk to the NATS guys too.

Performance tweaking in our case is mostly about garbage collection and distributing ram usage. I think they did fair job by providing all options. What we can do more is using partitioning etc. options of nats to go further but I see diminishing returns problem when I invested more time to nats. Also, probably maintaining nats instances and running them will cost something in total.

  1. ... faster: true/false ...

I think this is complex solution for simple problem that should be solved by more specialized services (message queues) outside of workflow engine. It may be redundant to add this, maintain this over time with all the bugs it brings.

Only benefit is probably decreased latency but in our use case latency that nats brings to table is nothing at all. We can use the same effort for "doing one thing well" instead of trying to implement communication/message storage layer.

  1. we write a google pub/sub implementation of the event store. However, I am not sure how 'low-latency' their pubsub service is, especially since in our case the publisher is often times the interested subscriber as well.

Google pub/sub and other cloud message queue support is probably option for getting more benefit with least effort of 3 option. I am not sure about their channel/subscription creation latency too but message delivery should be fast enough.

What we can do before starting an implementation is benchmarking specific operations (channel/subscription creation, message delivery from publisher to subscriber...) and compare it with nats for example.

Overall, I don't expect it to be unacceptably bad and getting rid of headache of maintaining it, running on big nodes for scaling to higher number of invocations etc. We can invest the same time to improve other bits of fission workflow. We are probably ok with short term workarounds until it's solved.

However, I know there are also people out there running fission on bare-metal. So we still need to consider improving default nats deployment for them.

ghost commented 6 years ago

I checked some benchmarks of google pub/sub from spotify and others. Maybe it's not that good :(

erwinvaneyk commented 6 years ago

What I understand from their docs are: "memory/file/sql are simple interfaces. If you don't like, you can add yours"

--max-bytes is per channel limit, not for all nats store. I thought it is for all nats store too. I tried to set this up 14GB while keeping k8s limit 16GB. Then I realized from docs that it is per channel limit. So you can only set memory limit to nats by channel_limit x max_bytes which is get tricky and feels like hack and inflexible.

Mmmh, I did not know that it was a per-channel limit. Weird that they don't provide a max_bytes for the overal limit. It seems to me like a common configuration you would want to set for any NATS cluster, or do they have other features to protect against OOMs?

Performance tweaking in our case is mostly about garbage collection and distributing ram usage. I think they did fair job by providing all options. What we can do more is using partitioning etc. options of nats to go further but I see diminishing returns problem when I invested more time to nats. Also, probably maintaining nats instances and running them will cost something in total.

Agreed!

What we can do before starting an implementation is benchmarking specific operations (channel/subscription creation, message delivery from publisher to subscriber...) and compare it with nats for example.

There might be reports or research papers that have done this benchmarking before.

I checked some benchmarks of google pub/sub from spotify and others. Maybe it's not that good :(

Interesting! Then I think we should start thinking about option 3, to create a hybrid solution that does not depend solely on the message queue. There are a lot of options in that space.