elastic / apm-agent-java

https://www.elastic.co/guide/en/apm/agent/java/current/index.html
Apache License 2.0
567 stars 319 forks source link

OpenJDK CRaC support #2982

Open ArtyomGabeev opened 1 year ago

ArtyomGabeev commented 1 year ago

Is your feature request related to a problem?

AWS announced support of snapstart in AWS Lambda based on Java runtime. It looks like this support based on OpenJDK CRaC, with ability to specify before/after context restore hooks, reference: https://aws.amazon.com/blogs/compute/reducing-java-cold-starts-on-aws-lambda-functions-with-snapstart/ https://docs.aws.amazon.com/lambda/latest/dg/snapstart.html#snapstart-compatibility

However if java agent attached before snapshot created, I think it will lead to incorrect results (as least sending APM data with same instance metadata).

Describe the solution you'd like

Ideally, we would like to bytebuddy instrumentation happen before snapshot is took. In afterRestore method, java agent should be able re update only necessary instance specific information

Describe alternatives you've considered

Attaching agent in afterRestore hook, however it returns back initialisation latency

Additional context

Implementing and providing CRaC restore hook may be useful not only for AWS Lambda application, but for any java application using CRaC project.

ArtyomGabeev commented 1 year ago

I completed some testing with enabled SnapStart and want to provide some information, which should help with implementation.

Setup:

I see, after invoking function, trace data in Elasticsearch. However, service.node.name field was not reported and in Kibana instance was shown as "\<unknown>".

There is a difference in environment variables when function started from SnapStart, AWS_LAMBDA_LOG_STREAM_NAME is not populated, and ServiceFactory does not fill node name in method augmentServiceForAWSLambda.

I've opened an issue on AWS java lib, let see if its expected: https://github.com/aws/aws-lambda-java-libs/issues/402

Note: If AWS_LAMBDA_LOG_STREAM_NAME won't be available for SnapStart functions, we can provide a fallback implementation based on ephemeralId, e.g.:

private static String generateNodeName(String ephemeralId) {
    DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy/MM/dd");
    LocalDate localDate = LocalDate.now();

    String version = "[" + PrivilegedActionUtils.getEnv("AWS_LAMBDA_FUNCTION_VERSION") + "]";

    return dtf.format(localDate) + "/" + version + ephemeralId.replaceAll("-", "");
  }

Moving forward with implementation, I think we need to register a crac listener to refresh tracer MetaData and ephemeralId. One thing to mention, that CRaC resource listener is not called, if listener loaded from ShadedClassLoader, and it looks like this listener should be in system classloader.

Since we support several options of attaching java agent, I do not know exact place to put the listener. Additionally, may be it makes sense to not attempt to register the listener if CRaS is not in a classpath.

For investigation purposes, assuming we are in AgentMain class, next implementation works:

public class AgentMain {

  private static final ApmResource resource = new ApmResource();

  static {
    Core.getGlobalContext().register(resource);
  }

  .....

  private static void awaitMetaData() {
    try {
      Class.forName("co.elastic.apm.agent.bci.ElasticApmAgent", true, agentClassLoader())
          .getMethod("awaitMetaData")
          .invoke(null);
    } catch (Exception ex) {
      ...
    }
  }

  private static void reloadAgent() {
    try {
      Class.forName("co.elastic.apm.agent.bci.ElasticApmAgent", true, agentClassLoader())
          .getMethod("updateMetaData")
          .invoke(null);
    } catch (Exception ex) {
      ...
    }
  }

  static class ApmResource implements Resource {

    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
      SystemStandardOutputLogger.stdOutInfo("Before checkpoint hook started");

      AgentMain.awaitMetaData();

      SystemStandardOutputLogger.stdOutInfo("Before checkpoint hook finished");
    }

    @Override
    public void afterRestore(Context<? extends Resource> context) throws Exception {
      SystemStandardOutputLogger.stdOutInfo("After snapshot hook started");

      AgentMain.reloadAgent();

      SystemStandardOutputLogger.stdOutInfo("After snapshot hook finished");
    }
  }

}

beforeCheckpoint

Whenever SnapSpart snapshot is created, lambda instance initialized, but without request invocation. Which leads to metadata future (when snapshot is taken) never complete, due to the fact that AbstractLambdaTransactionHelper.completeMetaData is never called.

Even if this metadata is not needed after restore, I think it make sense to complete it or cancel before snapshot is taken.

afterRestore

During snapshot restore, we need to generate new ephemeralId and recreate MetadataFuture, and propagate changes down to DslJsonSerializer.

DslJsonSerializer attempts to cache serialized version of metadata, so either serializer may be recreated or internal serializer cache needs to be invalidated.

Artyom.

eyalkoren commented 1 year ago

@ArtyomGabeev Thanks a lot for this wonderful input! I am sure it will be very useful once we get to it. We do intend to consider this as a priority for our next development cycle, although we cannot provide any estimation as to if and when.

ArtyomGabeev commented 1 year ago

@eyalkoren , I've opened a PR with initial CRaC support implementation. However, I've got a reply from AWS that neither AWS_LAMBDA_LOG_STREAM_NAME, neither context.getLogStreamName gonna be populated for SnapStart enabled functions - https://github.com/aws/aws-lambda-java-libs/issues/402 .

I do not know, if Kibana product have some special handling of AWS lambda APM data and if it relies on the format of configured_name from service.node field.

In initial proposal, I suggest to simulate and generate similar to log group node name, but right now I think we do not need to tie (if possible) instance name to log group name at all.

If we need just to distinguish different lambda instances, may be pair of lambda version + ephemeral id is enough, but this is a breaking change, if we are going to change it for all AWS Lambdas.

Another option I'm considering is to introduce a property, which will be a strategy for lambda instance naming. By default, we should still try to use AWS_LAMBDA_LOG_STREAM_NAME env variable, but we may allow another one, which will use auto generated name.

Appreciate any feedback, Thanks!

eyalkoren commented 1 year ago

@ArtyomGabeev thanks for the elaborated explanation, you can extend https://github.com/elastic/apm-agent-java/pull/3036#discussion_r1118115144 to include all this info as well. Unfortunately, we cannot allocate time for this right now, even if it is only about reviewing and making such decisions. However, since we are interested in that, we will review and try to prioritize it for our next dev cycle. The fact that you proposed an implementation and all this useful feedback definitely increases the chances for it to get in, even though we cannot guarantee that.

ArtyomGabeev commented 1 year ago

I've updated PR for 1.38.0 base and can do rebase if needed.

I have couple of questions:

  1. Looks like RequestStreamHandlerInstrumentation and RequestHandlerInstrumentation does not work when lambda handler extends any other class with overloading, where handleRequest is overriden in parent, e.g:

GenericHandler (has handleRequest with signature matching the byte buddy spec)

MyHandler extends GenericHandler (has handleRequest overloaded).

Is there a way to make byte buddy work with methods defined in parent classes as well?

Usecase: StreamHandler where der/ser logic are implemented in abstract class.

  1. That's one interesting, AWS Lambda snapstart does not invoke handler method during snapshot. Event if I try to warmup all classes, there is one instrumentation which does not - RequestHandlerInstrumentation/RequestStreamHandlerInstrumentation.

It adds additional latency on startup, current workaround I'm using is registering crac listener and invoking lambda handler with simulated AWS context. I do not think there is an easy solution for this problem, mb I can just document how warm-up before coordinated snapshot lambda handler.

Thanks, Artyom

NicklasWallgren commented 1 year ago

Has there been any progress into integrating support for CraC lately?

jackshirazi commented 1 year ago

hoping to get back to this next week, happy to have another review of the PR if anyone wants to add their input