aws-powertools / powertools-lambda-python

A developer toolkit to implement Serverless best practices and increase developer velocity.
https://docs.powertools.aws.dev/lambda/python/latest/
MIT No Attribution
2.88k stars 396 forks source link

Add easy support for Datadog and possibly other observability solutions #1433

Open petarlishov opened 2 years ago

petarlishov commented 2 years ago

Use case

I have played around with the Datadog and the AWS Powertools Lambda layers and as one that needs to integrate with Datadog, the Datadog Lambda layer is a good choice for getting that integration set up flawlessly.

But I also really enjoy some of the features that AWS Powertools have incorporated into their Lambda layer and as a developer I find it to be a very useful tool. Last time I checked (a while ago), the AWS Powertools layer was more lightweight as well.

Sadly, because both layers use similar packages (boto3 for example among other things), I believe they are not exactly compatible with each other. And either way, adding both would make our Lambdas' start times much worse as both layers are not exactly light either.

I have one collection of Lambdas using the Datadog layer and another collection using Poewrtools. I have thus noticed some of the differences that make Powertools tricky to easily integrate with Datadog despite being the more useful tool purely from a developer's perspective:

Solution/User Experience

Provide a configuration or an option to define the format when setting up loggers, metrics and traces, which would allow for better integration with other ovservability solutions other than AWS' CloudWatch and XRay

Alternative solutions

It may be possible to do all these already. Maybe a separate package that adds this compatibility can be created that works well with Powertools, as long as Powertools already has easy methods to manipulate its behaviour in the required ways to allow for the requested integration.

Acknowledgment

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! We'll come back to you as soon as we can.

heitorlessa commented 2 years ago

It is so rare to receive such a high quality feature request like this that I want us to take time to reply to you accordingly - please bear with us for an answer next week.

Until then, we're laying the ground work in E2E and Integ test framework to give us confidence to offer what you're asking -- either native support or expose the mechanisms we already have for customers to build them.

For anyone else reading this, please please add your +1 to the author to help us prioritise it.

Thank you for taking the time to share such rich detail.

heitorlessa commented 2 years ago

Hey @petarlishov, our new (internal) E2E framework took a lot longer than I expected to refactor, so I'm replying tomorrow morning to address your questions and a few asks as we started v2 in parallel.

heitorlessa commented 2 years ago

I'll break down my answer in categories to make it easier to parse later.

Making Lambda Powertools more lightweight

Sadly, because both layers use similar packages (boto3 for example among other things) And either way, adding both would make our Lambdas' start times much worse as both layers are not exactly light either.

I bet you'll be excited to hear that we've started working on v2 (minor breaking changes) to cut down the final package size to ~464K (compressed) :tada:

In v2, we are making all dependencies optional, e.g. boto3, AWS X-Ray SDK and fastjsonschema bringing a >90% reduction. Powertools feature set today work with the older version Lambda supports at runtime (~7-8 months old), and if a feature requires a newer boto3 our docs will suggest how to upgrade it accordingly.

@rubenfonseca is leading V2. We'll create a RFC to discuss trade-offs of relying on Lambda runtime's packages, and how we're thinking of our Lambda Layer v2.

NOTE: Our stream of consciousness for Lambda Layer is to likely have boto3 removed while adding all optional dependencies to ease distribution for consumers - botocore alone is ~67M today (uncompressed).

Modularization is our medium-term game

This is an intermediate stage towards modularization in V3. This would need a major structural change but allow customers to pick and choose what they need, going as far as ~16K package size if one wants to. That however needs a ton of research and testing to make sure it's stable and maintainable - we plan to draft a RFC next year once we're comfortable with v2 outcomes.

Despite being a major version, we want to prevent disruptions as much as possible to our customers. We're working on our first upgrade guide, and for v3 we even have the ambition to create a linter plugin to help you upgrade faster.

Long-term, this will give us the structure we need to add support to non-Lambda runtimes like Fargate, Glue jobs, etc. We have some customers using it that way, but we haven't put a "it's supported" stamp on it yet. We could even expose our private integ/e2e testing utilities as a package for customers ;)

Datadog Log format

It would be nice to be able to specify a Datadog log format that we can easily use within Powertools without having to create a custom one ourselves. I believe compatibility can be easily achieved here and all we need is a special logger format that mimics the one used by Datadog

I'm not sure if you've tried, but Logger supports Bringing Your Own Logging Formatter without forgoing Logger features and UX. We recommend that option for customers looking to only change the final format without having to maintain a different Logger implementation altogether.

I'm not fully aware of what Datadog expects in a structured logging and why they have difficulties to query a JSON field. That said, we're more than happy to investigate any non-breaking change we could do on our side if this spans more than Datadog - feature request please!

In V3, we'll be able to create a providers package where we could use community help to get these quirks addressed while not imposing everyone to receive a copy of a provider (e.g., Datadog) themselves.

Datadog Metrics format

I believe the easiest way to do a custom metric that gets interpreted by Datadog is to send logs in this format {"m": "Metric name", "v": "Metric value", "e": "Unix timestamp (seconds)", "t": "Array of tags"}

We could make this more extensible quite easily - wanna create a RFC?

RFC will help us agree on a contract for exposing a Metric Provider (a simple sink pattern), so that customers can use the same UX but have different outputs and validation mechanisms. As of now, we didn't invest much and this part of the code could be easily rearranged until we have a proper Provider - https://github.com/awslabs/aws-lambda-powertools-python/blob/develop/aws_lambda_powertools/metrics/base.py#L139

Datadog Trace compatibility

I have not tested it sufficiently, but if there is some incompatibility with the way Datadog handles traces, maybe someone else that has that knowledge can keep in mind that some adaptations would be useful there too?

Tracing has an undocumented BaseProvider for that intent but we haven't been able to put more thoughts into it. Now that we're making X-Ray SDK optional in v2, this becomes a more interesting conversation to have otherwise we'd be forcing customers to have X-Ray SDK lib when they were using a DataDog Provider.

The hardest part in Tracing is patching modules and nomenclatures (e.g., segment/span, patching only X but not Y lib) --- wanna create a RFC for what minimum feature set the BaseProvider should support?

I initially wanted to have a drop-in replacement Tracing Provider, but then digging into 3-4 tracing providers' lib I saw how much custom logic each provider does and I became less sure of it - a RFC can help us get there ;)

We also looked into Open Telemetry but the cold start was too significant, and it was a moving target in terms of changes too. I think exposing our BaseProvider is a good first step.

Overall

We're going towards that direction but we'd love help from the community in helping us define a good contract for Providers (Tracing has already).

Right now, our main focus is on operational excellence (E2E test) to ensure V2 can be smooth sailing, and pave the road for our future modularization story. We'll continue to respond to feature requests and greatly appreciate any help we can get - we can't wait to create new utilities and new extensibility mechanisms, but first we need confidence large changes can be made ;)

Once again, thank you for creating such a comprehensive issue. These make me personally happy that we have a lot to do but also emphasize that got a community who cares ;)

Hopefully that answers your questions and remarks, please let us know otherwise!

PS: Join us on Discord, we'd love to have you if you aren't there already.

cc @rubenfonseca @leandrodamascena @mploski @am29d @saragerion @sliedig

heitorlessa commented 1 year ago

Update: We estimate one RFC per feature (Tracer, Metrics, Logger) starting in mid-April/early May. This will help focus the discussion on a standard interface to help customers bring their own provider.

Since we launched V2, the only difference for Tracer is that we'd stick with AWS X-Ray SDK provider as the default, while providing a built-in provider for OpenTelemetry - other 3rd party providers (e.g., Datadog, Lumigo, NewRelic, etc.) would be owned by them where we'd be happy to collaborate/coordinate.

Thank you all!!

leandrodamascena commented 1 year ago

I'm removing the help wanted and need-customer-feedback labels because already have work in progress, so anyone can comment on these RFCs.

heitorlessa commented 1 year ago

UPDATE: We're adding support in Logger for Datadog as a start. We're working on a POC for Metrics, and adding last refinements for Tracer Providers.

For Logger, Datadog was the only provider that required a custom timestamp so we've added a Formatter, and documented our recommendation to use Lambda Extensions to not impact in performance.

The only reason we're not adding OTel Log output now is because it's not Final yet - please feel free to open a feature request when that happens (whoever is reading and need that)

image

heitorlessa commented 1 year ago

cc @leandrodamascena for an update on Metrics Providers

leandrodamascena commented 1 year ago

Hey everyone! Adding some new information here to keep everyone up to date on our work.

We are working to merge the Metrics Providers as soon as possible. Our current plan is to merge this PR by the end of July. We're taking the opportunity to address some tech debt in the Metrics utility and provide a great experience for customers who want to use a provider other than Amazon CloudWatch EMF.

I'll update this issue by mid-next week if anything changes.

leandrodamascena commented 1 year ago

We are about to merge PR https://github.com/aws-powertools/powertools-lambda-python/pull/2194 which will allow us to add external providers in the metrics utility. Our initial plan was to add Datadog to this PR, but we are waiting for Datadog to release a new version of their extension and fix an issue with a dependency: https://github.com/DataDog/datadog-lambda-python/issues/350.

Support for Datadog will likely be delayed for a few days, but we are working hard to add this support as soon as possible.

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

heitorlessa commented 1 year ago

Quick update: Datadog made the bugfix release we needed, we're tentatively looking to launch Datadog Metrics Provider by the end of next week ;)

Then we move on to adding Observability Provider to Tracer (rinse and repeat).

leandrodamascena commented 1 year ago

Hey everyone! We are very happy to announce that we have merged PR (#2906) to add support for the Datadog Metrics provider and create a way for adding more observability providers. We are planning a release for this week.

For now we are going to work on adding external observability providers to Tracer. As soon as we have news I'll update this issue.

Thank you to everyone who is working hard to make Powertools for AWS Lambda (Python) even better.

heitorlessa commented 8 months ago

Quick update -- @roger-zhangg is working on the last feature: Observability Provider for Tracer. Once that's done, we'll close this issue, and start investigating an alternative solution for OTel as cold starts haven't significantly improved.