Lambda: Delay Metadata Fetching/Populating Until First Function Invocation

astorm commented 2 years ago

Some metadata fields are not available at startup (e.g. invokedFunctionArn which is needed for service.id and cloud.region). Therefore, retrieval of metadata fields in a lambda context needs to be delayed until the first execution of the lambda function, so that information provided in the context object can used to set metadata fields properly.

This metadata is both cloud metadata fields and standard metadata values. Both elastic/apm-agent-nodejs and elastic/apm-nodejs-http-client presume this metadata is set once during agent startup and never again. We'll need to take steps to ensure that this metadata isn't set until the first function invocation, and setup some sort of system/code to get data from the context argument of the lambda handler into the encoded metadata in the client.

trentm commented 2 years ago

tl;dr I intend to implement "Option 2" described below.

How metadata works in the current Agent and http `Client`

On process start:
- The Client top-level code uses the container-info package to gather container info.
On agent.start():
- The Agent creates a Client instance and passes in some config.
- The Client gathers static metadata from this config and some process.* vars.
- If that config includes cloudMetadataFetcher, then the Client tells it to start gathering cloud metadata, with a callback. The Client starts "corked" (i.e. won't start an intake request) until that callback, in case transaction/span/metrics data comes in before cloud metadata is ready.
On cloud metadata fetch callback:
- The Client merges cloud metadata with the earlier static metadata, sets this._encodedMetadata to be sent for intake requests, and uncorks (to allow intake requests).
On agent.setFramework(), typically called on import of a web-framework package, i.e. typically before any data is sent to APM server:
- The Client updates this._encodedMetadata.
On agent.addMetadataFilter(), typically would be called by a user before any tracing data is sent:
- The Client updates this._encodedMetadata.

In a lambda environment there is special handling in the cloudMetadataFetcher that immediately returns a subset of the required Lambda metadata. Only a subset because the some fields are derived from data passed to the first Lambda function invocation.

Option 1: the simplest thing

The simplest addition I see is to add the following to the above "How metadata works":

On first apm.lambda() (this is the wrapper around the Lambda function handler):
- Derive the extra metadata fields from the context object and call a new <Client>.setExtraMetadata(fields) method.
- The Client updates this._encodedMetadata with these fields.

Possible issues with this: In all expected usage of a Lambda function, no transactions/spans/metricsets will be sent to the Client before that apm.lambda() starts and can setExtraMetadata() first. However, it possible this assumption is broken: If the top-level code in the JS file with the function manually starts/ends a transaction. If eventually we have agent-created metrics in Lambda and the initial metricset comes before the Lambda handler is called.

So if we want to handle those odd cases, then we want some similar kind of corking as with cloudMetadataFetcher above. One way:

Add a boolean option to the Client to tell it to start corked, because it should expect a setExtraMetadata() call before sending data to APM server. (Call it expectExtraMetadata or whatever.)
Client#setExtraMetadata() calls this._maybeUncork().

Currently the internal coordination doesn't know how to wait for both setExtraMetadata() and the callback from cloudMetadataFetcher. It could be made to know how, but instead ...

Option 2: no cloudMetadataFetcher for lambda

With option 1 we would be splitting the metadata gathering for lambda in two places: (a) the static Lambda metadata in CloudMetadataFetcher and (b) the metadata that needs the invocation context in apm.lambda(). Let's move it all to the latter.

In a lambda env the Agent does not pass a cloudMetadataFetcher to the Client. (The lambda-related code in CloudMetadataFetcher can go away.)
Instead the Agent sets the expectExtraMetadata=true option, so the Client starts corked.
In apm.lambda() all the Lambda metadata is gathered on cold start and passed to client.setExtraMetadata().
It is an error to create a Client with both expectExtraMetadata=true and cloudMetadataFetcher. This solves the issue above of the Client internal coordination not knowing how to wait for both. The client now only starts "corked" if either of those two options are set.

You can stop reading here, if you like. My intent is to implement Option 2.

Option 3: overhaul metadata handling between Agent and Client

This option is only described here as a possible longer term refactoring and to show why option 2 is like it is.

All metadata gathering is moved from the Client to the Agent repo -- where IMHO it belongs.
The Client always starts "corked". The only proper usage of it is to: create the client and call client.setMetadata() sometime soonish before any provided transactions/spans/metricsets will be sent to APM server.
Client#setMetadata() encodes the metadata and saves it for subsequent intake requests, then uncorks.
The Agent may later call Client.updateMetadata() -- to handle agent.setFramework() and or agent.addMetadataFilter(). In cases where those agent methods are called before it has setMetadata(), it can avoid the duplicate calls.

This would be a lot more code churn right now, so I think it is best left to separate future work.

trentm commented 2 years ago

An example of the metadata being sent in a Lambda with my in-progress patches:

{
  "metadata": {
    "service": {
      "name": "trentm-play-fn1",
      "environment": "development",
      "runtime": {
        "name": "AWS_Lambda_nodejs14.x",
        "version": "14.17.4"
      },
      "language": {
        "name": "javascript"
      },
      "agent": {
        "name": "nodejs",
        "version": "3.23.0"
      },
      "version": "$LATEST",
      "id": "arn:aws:lambda:us-west-2:612345678904:function:trentm-play-fn1",
      "framework": {
        "name": "AWS Lambda"
      },
      "node": {
        "configured_name": "2021/11/01/[$LATEST]e7b05091b39b4aa2aef19efe4d262e79"
      }
    },
    "process": {
      "pid": 17,
      "ppid": 1,
      "title": "/var/lang/bin/node",
      "argv": [
        "/var/lang/bin/node",
        "/var/runtime/index.js"
      ]
    },
    "system": {
      "hostname": "169.254.154.197",
      "architecture": "x64",
      "platform": "linux"
    },
    "cloud": {
      "provider": "aws",
      "region": "us-west-2",
      "service": {
        "name": "lambda"
      },
      "account": {
        "id": "612345678904"
      }
    }
  }
}

elastic / apm-agent-nodejs