Lambda use-case - Githubissues

moolen commented 5 years ago

We use this SDK in a AWS ApiGateway/Lambda context.

AWS recommends that all callbacks/background processes should complete before the function returns [1].

So we should implement it like in this example:

exports.handler = (event, _, callback) => {
    const { initialize, isEnabled } = require('unleash-client');
    const instance = initialize({
        url: process.env.TOGGLE_ENDPOINT,
        appName: 'my-app',
        instanceId: 'my-lambda',
        refreshInterval: 15,
    });

    instance.on('ready', () => {
        const enabled = isEnabled(event.toggle, event.context)
        // do something with $enabled
        console.log(`flag ${event.toggle} enabled: ${enabled}`)

        // destroy
        instance.destroy();
        callback(null);
    });
}

This means, that each invocation has to call initialize() in a lambda handler and also call instance.destroy() when it completes to properly clean up callbacks and resources. This inevitably issues unnecessary requests to the api server [2] since the user has no access to the internal storage implementation and the user has to wait until unleash as a whole is ready.

We could call initialize() outside the lambda handler - but we then might lose metrics from that instance which is not really what we want. For now, it's a OKish trade-off for us.

Example implementation:

const { initialize, isEnabled } = require('unleash-client');
const instance = initialize({
    url: process.env.TOGGLE_ENDPOINT,
    appName: 'my-app',
    instanceId: 'my-lambda',
    refreshInterval: 1,
});

instance.on('error', (err) => {
    console.log("error ", err)
});

exports.handler = (event, _, callback) => {
    const enabled = isEnabled(event.toggle, event.context)
    console.log(`flag ${event.toggle} enabled: ${enabled}`)
    callback(null);
}

This has the following drawbacks:

during the first invocation there are no feature toggles yet
once AWS wants the lambda to die the metrics from the last $metricsInterval interval are not sent

What we want is to have control over the caching layer and/or the toggle update cycle in order to address this issue.

We'll discuss this internally first and i'll come back with a proposal. For now i think it's valueable to have this "limitation" documented here.

Please tell me if that works for you or when you have input on that topic.

[1] https://docs.aws.amazon.com/lambda/latest/dg/running-lambda-code.html [2] https://github.com/Unleash/unleash-client-node/blob/master/src/repository.ts#L51

ivarconr commented 5 years ago

It's awesome that you are looking in to this. I have been thinking that the serverless use case might challenge the design, but have not had the time to think out a solution for this yet.

We could call initialize() outside the lambda handler - but we then might lose metrics from that.

yes, but I thought part of the idea of a lambda-function was that it will be around for some time as long as the function was "hot"? Even for function there will be an initialisation cost I assume. But yes I know the docs states that you should not rely on it.

What we want is to have control over the caching layer and/or the toggle update cycle in order to address this issue.

I think this is part of the solution. I had some idea at some point that all toggles would be stored in a cache, redis, ElastiCache or similar. The the metrics could also be stored in the same cache. Then you could have one scheduled function to updated the cache with new state and push metrics to the unleash-server. (And it's even kind of stupid that the SDK store a local backup to a "file on disk" in a world of immutable docker containers.)

We'll discuss this internally first and i'll come back with a proposal. For now i think it's valueable to have this "limitation" documented here.

Thanks. Always appreciated experience people invest time in finding ways to improve and adapt.

Please tell me if that works for you or when you have input on that topic.

Thanks, this has also triggered my curiosity and I will for sure start discussing this issue with my colleagues as well. They tend to give me good ideas.

moolen commented 5 years ago

it will be around for some time as long as the function was "hot"?

On a cold start, the nodejs process boots and loads some AWS wrapper for messaging, metrics etc. which then in turn loads your consumer-code.

Once the consumer application has done its work the nodejs process is frozen. Literally no CPU instructions are being executed. On the next invocation the process gets woken up and the nodejs runtime uses CPU cycles as usual until the consumer app is done. Here is no overhead of loading either the AWS wrapper nor your application code.

[...] the metrics from the last $metricsInterval interval are not sent

We understand metrics as volatile telemetry data. They have a supporting character and no business decisions should ever be based on metrics. However, having a robust telemetry delivery, processing and alerting pipeline is important. Here in this case there is a chance to lose all metrics from the last $metricsInterval before a lambda is being recycled.

For now, we'll set a low $metricsInterval to minimize the potential loss. A sustainable and scalable solution would IMO involve (both) a intermediary aggregating relay proxy (1) similar to launchdarkly's ld-relay and (2) a programmable metrics and repository object.

Using (1) we could:

pre-aggregate and forward outgoing metrics to the api server
cache reads against the api server - especially usful when using a multi-process shared-nothing architecture like PHP or python/uwsgi

Having (2) would:

enable us to re-use our existing metrics infrastructure for lambdas by injecting a custom metrics object into unleash. It would also give others the flexibility to use different metric protocols (statsd/prometheus pushgateway etc.)
enable us to re-use our existing caching infrastructure to minimize the number of api server calls

TBH, the invest into (1) a relay is pretty darn high for both initial development and maintenance. It really helps at large scale. But i think our scale is not that big yet to really make use of that. However, going for (2) is a low-hanging fruit that solves our immediate needs. Let's keep it LEAN.

I made a PoC to get a feeling of how the repository/metrics option could look like.

What do you think?

ivarconr commented 5 years ago

However, going for (2) is a low-hanging fruit that solves our immediate needs. Let's keep it LEAN.

Yes please!

I made a PoC to get a feeling of how the repository/metrics option could look like.

Cool, I think it makes sense to be able to swap out the repository.

Side note: The Java SDK has always allowed you to inject your own toggle repository.

ivarconr commented 5 years ago

Once the consumer application has done its work the nodejs process is frozen

ah, thanks for this detail. This means it would be impossible to keep some kind of state up to date outside the client requests.

moolen commented 5 years ago

Thanks for your Feedback! PR #117 is out. I'll close this issue then.

ivarconr commented 5 years ago

@moolen How is your hands on experience with this approach? Have you learned something useful?

I have been thinking a bit about the caching strategy. It might be better to add in-memory caching in the unleash-service in /api/client/features than to introduce a dedicated cache service, such as redis, in between unleash and the client.

This will however not solve the metrics needs.

moolen commented 5 years ago

so far, everything is running smoothly! I've noticed that every single request against /api/client/features results in a DB query. Sure, adding a cache here is absolutely a viable option to increase the throughput of the api server. But you know..

There are only two hard things in Computer Science: cache invalidation and naming things. - Phil Karlton

:smiley:

ivarconr commented 5 years ago

Thanks for sharing. Out of curiosity, what kind workloads (req/s, number of active function) are you running?

moolen commented 5 years ago

It depends on how you categorize and measure it. On the feature-toggle service itself i saw ~45 req/s at peak times without having issues. But usually there are around ~2 req/s.

We have about 70+ lambdas, 40 ECS services per environment (prod/staging etc.), but not all do implement feature toggles. We do have a feature-toggle per environment

upputa1 commented 4 years ago

@ivarconr I have read the documentation and this PR again and again, but not able to comprehend how injecting own repository can be used for caching especially for serverless use case. Can you please elaborate? Thanks in advance!

moolen commented 4 years ago

Hey @upputa1, the Repository contains the state of available feature toggles. We implemented the cache like this: Constructor: read cache file in /tmp/cache.json. If it exists populate this.data. If it doesn't make a call to the api server and save the response to the cache file.

Im on vacation right now, i think it makes sense to add a usable repository example to the docs. Ill add it once im back (or feel free to contribute :) )

upputa1 commented 4 years ago

Appreciate your quick response. I am still not getting the value of this cache as opposed to getting defaults for the flags from environment vars or just plain defaults.

ricardovf commented 4 years ago

Hi! Im exploring this great project. We use AWS Lambda and Redis. Instead of caching to /tmp/cache.json can i cache this to redis?

upputa1 commented 4 years ago

If you are using Redis already, you can query redis in your lambda, but have unleash webooks refresh your redis cache. That way you dont experience latency at all.

ricardovf commented 4 years ago

@upputa1 that seems great, can you point me to an example or documentation about the webooks? Did not found on a quick search on the docs. Thanks!

Unleash / unleash-client-node

Lambda use-case #116