apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.48k stars 1.16k forks source link

Allow limiting DB bloat by excluding response from Activation record in some cases #4626

Open tysonnorris opened 5 years ago

tysonnorris commented 5 years ago

Using a single db for many tenants that use web/blocking actions, we end up with:

We plan to introduce a PR to:

Need to investigate potential problems with sequence/composition usage (so scope of actions where response would be disabled, may include these as well).

style95 commented 5 years ago

Regarding this issue, I think the main reason is the limited scalability of CouchDB. If CouchDB receives too many requests, it cannot handle them properly and even crashes sometimes.

Also, more and more data is stored, more and more CouchDB would be highly likely unavailable. It lacks the functionality to manage existing old(unused) data. As OW depends on "view" of CouchDB, we need to periodically trigger indexing of views to prevent CouchDB from the crash while it indexes too much data.

If the backend store is scaling enough to handle all data and easy to delete unnecessary(old) data, it would solve many parts of this issue.

For example, if we introduce a scalable datastore such as ElasticSearch, it would greatly alleviate the situation.

I have observed many issues with CouchDB and we here(Naver) replaced it with ElasticSearch and observed no issue so far.

It outperforms CouchDB in many aspects such as :

  1. Enable full-text search of activation(including logs)
  2. Better scalability
  3. Easy to manipulate data especially in terms of deleting old(unused) data with the index alias.

AFAIK, @dubee had worked on ElasticSearchActivaitonStore some time back, but it was not completed for some reasons. I suspect there may be some historical reason to tie up with CouchDB?

If there is no other historical reason to persist CouchDB, I would like to bring up the ElasticSearchActivationStore again and we are willing to contribute it as well.

rabbah commented 5 years ago

👍+1

sven-lange-last commented 5 years ago

Thanks for the proposal - it makes a lot of sense.

We also see house-keeping problems with activations in CouchDB - we actually use Cloudant. A concept that fully removes activation documents from CouchDB after a retention period is complex and consumes a lot of CPU on the database cluster. In addition, a large amount of activation documents consumes a lot of space. As a consequence, storing activations can get pretty expensive if you run a large-scale production environment.

When going forward with the proposal of not storing activations for blocking + successful + "not timed out on the front-end" invocations, we need to check with clients whether their existing workloads break with such a change. Migration and evolution of the system is always the challenge of running a public production environment.

We had in mind to introduce a kind of activation store throttling - for example, only store a limited amount of activations per minute. This gives you a nice developer experience when starting with Openwhisk and for trying things out. But when running large production workloads, the workload owner would have to take care of storing action results - it's no more done by the platform.

Your proposal with not storing blocking + successful + "not timed out" invocations has the big advantage that you will either receive activation results as result of your blocking activation or by explicitly asking for activation results. With activation store throttling, there are situations where the platform does not give you the activation result at all.

Regarding the ElasticSearchActivationStore - we recently moved from an ELK based logging stack to LogDNA which bases on a different technology. That's why the ElasticSearchActivationStore is no more in focus for us. At the moment, we are storing activations to our databases as well as forwarding them to the client's LogDNA logging space.

I think there are two valid strategies in this area: reducing the amount of activations as well as using better options for storing activations.

style95 commented 5 years ago

@sven-lange-last All makes sense to me.

We also introduced a flag(volatile) to let users decide whether to store their activations or not when we were using CouchDB as an activation store. At that time, it was an optional flag so if a user does not explicitly set the flag, all activations were stored. Since it was disabled by default, no one tried to use it. We had to urge heavy users to use the flag. Since the main problem is the scalability of the system, users are supposed not to consider the system issue. They always wanted to see their results no matter of the environment(dev/production). Even though we enable the feature by default, we can not stop users disabling the flag(store activations) all the time.

We stored failed activations even though the flag is enabled for better debugging, but anyway users wanted to see their successful activation results sometimes. Not only the activation results but they also wanted to see metadata such as initTime, waitTime, etc. (I think if we want to reduce the size of activaitons, it could be one option to store metadata and just skip results.)

So we dropped volatile flag, stopped using CouchDbActivationStore and introduced ElasticSearchActivationStore.

Regarding reducing the number of activations, I think there are also two aspects.

  1. Reducing the number of activations/second
  2. Reducing the number of stored activations.

With CouchDB, we observed issues in both cases. (Even with 10GB data, CouchDB sometimes started dragging.)

Regarding number 1, I think even though we reduce the amount, there would be a limitation as the size of cluster grows. We anyway need to handle(store) some portion of activations and if the cluster scales, anyway this portion would be beyond the CouchDB can handle. I still agree that this would be a good option no matter which datastores are used, but anyway we need to secure some level of scalability at some point.

Regarding number 2, most of the users query relatively recent activations. They tend to query activations within 1 month ~ 3 months. It would not be cost-effective to store all activation data during 1 year ~ 2 years. So we decided to keep relatively "recent", but "all" activations. (I think many other OW operators already took a similar approach.)

So I think our datastore should have enough scalability to handle some level of requests/s and it should have the ability to take care of "cold" data which is rarely accessed. ElasticSearch is a great option for this.

With regard to log collection, I have been curious whether there "is" a case where one function generates 10MB logs. I am not sure it is realistic. As the granularity level of invocation is small in the serverless world, I think the size of the logs should be small as well.

Currently, logs are collected asynchronously aside from activation storing even though logs are also included in activation data. I think this is because the log size can be up to 10MB. If we can limit the size of the maximum logs to relatively small one such as 1 MB or 512KB,(I think this is also quite big enough) we can just store logs with activations with one request.

Storing logs with ELK is great but if we store them along with activations in ElasticSearch, it would give another value to users. Users can query data using logs in conjunction with metadata. (e.g: They might want to see logs whose waitTime is bigger than 1s.)

tysonnorris commented 4 years ago

So far my plan has been to disable storing the response in the Activation, as opposed to not storing the Activation at all. That is, for every activation, there would still be an Activation document created in the DB. However, where there is a response.result field, that field would be empty in some cases. This is different, I think, that what is described by @style95 and @sven-lange-last above, but I'm open to discuss if there is preference for removing the Activation completely in the case where I'm planning to only remove the response.result field.

FYI, we are using cosmos, and the data size has become unwieldy party because of size, partly because of number of documents, and partly because of query load. This would address the size part, but not the others.

After some experimentation, the set of conditions where result would not be stored is:

This last one is a bit complicated due to mixing of blocking/unblocking behavior - I wish these were explicit annotation at action level, instead of activation-time parameters. With current system, we need to coordinate the max blocking wait time (currently 60s) with the result-removal timeout (i test with 45s) so that we have some buffer time to be sure that if we get a result, it should make it back to user before user gets the non-blocking treatment for their blocking activation. (i.e. in my config, I assume that activation completion to controller response to user takes less than 15s).

PR coming soon, once tests are not timing out.