aws-powertools / powertools-lambda-python

A developer toolkit to implement Serverless best practices and increase developer velocity.
https://docs.powertools.aws.dev/lambda/python/latest/
MIT No Attribution
2.8k stars 388 forks source link

Feature request: Buffer debug logs and only emit on exception #4432

Open hiselitelordship opened 3 months ago

hiselitelordship commented 3 months ago

Use case

CloudWatch Logs are vital, but expensive. To control costs you switch log level from DEBUG to INFO. When an exception happens you have very little context to debug what happened. You swtich back to DEBUG level and hope(!) the error happens again.

Solution/User Experience

By configuring a buffer, debug logs would not be written to CloudWatch Logs unless an exception is encountered. In this way, CWL is low cost and minimalist. But, like a stack trace, when an exception is encountered the buffer is flushed to CWL and all the context of the preceding events is captured.

The buffer should not create memory pressure and become the cause of exceptions. Using something like a ring buffer oldest entries will be lost in favour of the freshest. There is likely a threshold in which a flush to CWL could take too long as an execution approaches the 900s exection threshold. Therefore the memory allocation or number of buffer lines could be configurable.

Alternative solutions

No response

Acknowledgment

boring-cyborg[bot] commented 3 months ago

Thanks for opening your first issue here! We'll come back to you as soon as we can. In the meantime, check out the #python channel on our Powertools for AWS Lambda Discord: Invite link

heitorlessa commented 3 months ago

(haven't forgotten, adjusting to jet lag from ANZ as I'm back today) - this is a great idea and we were waiting from customer demand to consider it.

I'll share thoughts and questions tomorrow

heitorlessa commented 3 months ago

unless an exception is encountered

This is tricky part. As of now, I can think of these scenarios:

Did I miss anything?

I'd love to hear your thoughts on both.

hiselitelordship commented 3 months ago

I imagine the new behaviour would be Logger(debug_on_exception=True). Explicit and backward compatible. Existing code would therefore not buffer and every debug line is published as expected. No surprises for anyone already using Powertools.

If I set this new option, I expect the logger.exception to flush out the buffer like a stack trace – that is why I set the option. If I didn’t set the option above, then I don’t want this buffering feature and there will be no stack trace or buffering.

But your point remains. If I want to selectively stop publishing the “stack trace” I can’t.

heitorlessa commented 3 months ago

I expect the logger.exception to flush out the buffer like a stack trace Explicit and backward compatible

Yeah, that's the direction we'd go given our tenets. On Monday (17th), I'll summarize actionable points and tests we've made while we wait for customers +1.

NOTE: we'll implement this across all languages once we have sufficient demand.

xkortex commented 2 months ago

First, I love the idea of this feature, it's definitely a "have your observability cake and eat it too" moment.

Logger(debug_on_exception=True)

+1, that seems like the best way to implement to start. An env var which also sets it would be additionally welcome, as then users could click-ops turn on debug tracing in response to an unforeseen issue without a code deploy.

If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.

This seems to me like the least surprising behavior, given a function wrapped with @logger.inject_lambda_context. That seems the most logical hardpoint to catch broad exceptions.

As for in-line vs separate thread, whew, that's a challenging one. I definitely see the appeal of the lambda returning as fast as possible on error, but you really don't want your error handling code to itself get swallowed by the void, so you'd want to especially make sure the thread kicked off has a very high degree of success at emitting the error. As long as extension threads are able to keep the lambda hot long enough to complete the handler, I'd be concerned about the lambda being scaled-in mid-process. But I would think the error packaging would be fast enough that it'd be unlikely to drop (I'm just not super knowledgeable on lambda innards to know what keeps one alive).

Perhaps you could define a handler interface with a default stderr impl and let users inject their own function to run such that they could override the error rollup behavior, use threads or not, customize the debug sink (maybe they want to dump to S3 or datadog or whatever).

heitorlessa commented 2 months ago

@sthulb when you can, would you mind dropping the memory test you did for verbose log records if kept in a circular buffer for this feature?

We're handling some critical operational bits this week hence my extended delay in responding it. We're still committed to do this feature and plan to deprecate the log sampling (as this is a much better alternative)