elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
384 stars 114 forks source link

Support Partial Transaction for AWS Lambda #753

Closed estolfo closed 1 year ago

estolfo commented 1 year ago

Description

With https://github.com/elastic/apm-aws-lambda/pull/344 the AWS Lambda extension supports registering transactions at the beginning of the transaction/lambda invocation. With that, the extension can finalize the transaction in case that there is an error and the agent fails to report the final transaction. Implement the ability to send(register) a partial transaction (in sync) to the AWS Lambda extension with the header x-elastic-aws-request-id containing the lambda invocation's request ID. The extension will cache the partial transaction and return a 200 OK response. If the extension returns 404 then the agent should stop sending the registration request for future invocations.

Spec: https://github.com/elastic/apm/pull/799

Agent Issues

basepi commented 1 year ago

@lahsivjar would you mind adding some more context? I know we have the draft PR from the go agent but I'm still having a hard time parsing the fine details here.

What is a "partial transaction"? Should I compose a barebones transaction dict, with only the id, timestamp, trace_id, and name? Or send everything I have at transaction start? We'll have most of the request context, but obviously will be missing span counts, outcome, response context, etc.

I assume we're sending this to the normal intake/v2/events endpoint? I know I need to add the x-elastic-aws-request-id header, is that the only change from a "normal" transaction json object?

I assume I send the metadata blob first as usual? Currently any time we flush our events buffer, the metadata gets sent first. Since the python agent is the only agent that doesn't stream to the APM server this might be an important point for only python -- I don't know how the APM Server would feel about two metadata blobs in a row, assuming you forward the metadata blob but not the partial transaction. It would certainly be easier for my implementation if the extension didn't complain about the second metadata blob when I report the full transaction at the end, but if needed I can modify the transport to exclude it in this specific case.

Thanks in advance.

lahsivjar commented 1 year ago

@basepi The partial transaction flow works like below (there is a pending PR that implements some part of the following based on the received feedback regarding handling metadata):

  1. At the start of the invocation, the agent will create a partial transaction. Partial transaction is nothing but a JSON encoding of all the transaction data that can be provided at this instance of time (containing transaction ID, FAAS fields, traceparent, tracestate, and other fields as usual). Along with the partial transaction, the agent is encouraged to send the metadata as the first line in the request body. Note that, if metadata is sent then it is used by the extension for all future calls made to APM-server and it cannot be updated.
  2. Next, the agent will send(register) the optional_metadata + partial transaction (in sync) to the extension with the header x-elastic-aws-request-id containing the lambda invocation's request ID. The endpoint for this is /register/transaction and it accepts Content-Type as application/vnd.elastic.apm.transaction+ndjson. The extension will cache the optional metadata + partial transaction and return a 200 OK response. If the extension returns 404 then the agent should stop sending the registration request for future invocations.
  3. After registration, agent will continue as normal. The extension will receive the transactions, spans, etc via intake v2 protocol and inspect the request payload to check if the request contains the transaction with the ID registered by the agent.
  4. If the extension finds a transaction then it releases its cache, otherwise, on shutdown or on receiving platform.runtimeDone event for the invocation the extension will create a proxy transaction on the behalf of the agent and send it to APM-Server.
Old (out of date) questions

Based on above, I will try to answer your questions. > What is a "partial transaction"? Should I compose a barebones transaction dict, with only the id, timestamp, trace_id, and name? Or send everything I have at transaction start? We'll have most of the request context, but obviously will be missing span counts, outcome, response context, etc. I think we should send everything that we can glean at the transaction start start of the invocation. > I assume we're sending this to the normal intake/v2/events endpoint? I know I need to add the x-elastic-aws-request-id header, is that the only change from a "normal" transaction json object? The partial transaction should be sent to the `/register/transaction` endpoint with the `x-elastic-aws-request-id` header. The body will be encoded in the same way we encode the transaction JSON object while sending it to the intake endpoint. > I assume I send the metadata blob first as usual? When calling the register endpoint only the encoded partial transaction is expected by the `/register/transaction` endpoint. This means it shouldn't have the metadata blob.

basepi commented 1 year ago

@lahsivjar Can you update the above with your changes in https://github.com/elastic/apm-aws-lambda/pull/384 ?

Also we should probably update the spec at some point with all of this stuff.

lahsivjar commented 1 year ago

@basepi I have updated the above message with the new details. Once the PR is finalized I will take a stab at updating the spec.

JonasKunz commented 1 year ago

If the extension returns 404 then the agent should stop sending the registration request for future invocations.

I think we should extend that to if the extension returns 4xx. I implemented the approach explained here and this yields a 415 from the current release of the extension due to the changed Content-Type header.

trentm commented 1 year ago

I think we should extend that to if the extension returns 4xx.

Sounds good to me. In my implementation if it returns anything other than the expected 200, then I log a message and stop using the endpoint. Do you think that is overkill?

basepi commented 1 year ago

In my implementation if it returns anything other than the expected 200, then I log a message and stop using the endpoint. Do you think that is overkill?

Probably fine. Technically an APM server can return 429 for rate-limited. But (1) I don't think the extension does this and (2) in a back-off scenario it probably makes sense to just stop registering partial transactions anyway.

I'll update my implementation to be more broad, right now I'm only doing the explicit 404.

lahsivjar commented 1 year ago

In my implementation if it returns anything other than the expected 200, then I log a message and stop using the endpoint. Do you think that is overkill?

The initial intention was to ensure backward compatibility of the extension with the agents but by looking at the bigger picture, taking a broader approach to stop registration on any error sounds good.

Thanks for raising this @JonasKunz. I will also release the new version of the extension with the latest changes this week.

lahsivjar commented 1 year ago

Lambda extension v1.4.0 has been released with the support for metadata in the register transaction call.

basepi commented 1 year ago

@lahsivjar do you have the bandwidth to do the spec PR for this? If not I can do it.

lahsivjar commented 1 year ago

@basepi Thanks for the offer, I am a bit swamped right now and will appreciate the help 🙇

basepi commented 1 year ago

Spec is in https://github.com/elastic/apm/pull/799