elastic / apm-agent-dotnet

https://www.elastic.co/guide/en/apm/agent/dotnet/current/index.html
Apache License 2.0
584 stars 209 forks source link

PayloadSenderV2 does not send data - AWS Lambda #518

Open gregkalapos opened 5 years ago

gregkalapos commented 5 years ago

Follow-up from #491.

The nullref from #491 is fixed, but during testing I realized that the agent does not work in AWS. AWS offers a template for ASP.NET Core apps and with that probably we would not have our flushing problem.

But I realized that the agent is unable to send data to the APM Server - No exception is visible in the logs, but probably something is wrong in PayloadSenderV2. Logs attached: awslogs.txt

I'm not sure this is AWS related - it's definitely an environment where this happens.

vhatsura commented 5 years ago

Some time ago we had a similar issue with long running task. Our service has very simple self-written background job:

public void Start(CancellationToken token)
{
    Task.Factory.StartNew(
        async () => RunAsync(token),
        CancellationToken.None,
        TaskCreationOptions.LongRunning,
        TaskScheduler.Default);
}

private async Task RunAsync(CancellationToken cancellationToken)
{
    while (!cancellationToken.IsCancellationRequested)
    {
        // do some job
    }
}

From business perspective of view, service has heavy load from our users every end of month. At this time our background job stops unexpectedly: we don't see any effort which is doing in the job. In process dump no any thread is busy with background job. Service restart helps to fix service work. During investigation, we had an assumption, that under heavy load CLR triggers garbage collection in 2nd generation due to high memory consumption and by this reason task associated with background job is collected by GC. Under performance monitor our assumption was confirmed: just before stopping background job we have event from GC about collection garbage in 2nd generation. The fix in our solution was simple: store reference to returned task to internal field. It solved our issue.

In addition, during investigation we found out a very interesting article about async/await and LongRunning task: http://blog.i3arnon.com/2015/07/02/task-run-long-running/.

Returning to issue, which was registered in AWS, I found out the similar peace of code:

https://github.com/elastic/apm-agent-dotnet/blob/045a07bfc1d77962b98cca78779ea4be063851c7/src/Elastic.Apm/BackendComm/BackendCommComponentBase.cs#L64

Also, as I see your AWS instance has low memory size (512 MB), if I look into right numbers. It can cause more aggressively garbage collection, which can destroy task. Unfortunately, I cannot reproduce the issue locally and by this reason cannot confirm that store reference to started task in a local field can solve the problem, however I suggest to try it and check once again in AWS. @gregkalapos, WDYT? It doesn't seem as a big change.

SergeyKleyman commented 5 years ago

@vhatsura These are very interesting insights and it is indeed really easy to try out your proposal to hold the reference to the Task returned by Task.Factory.StartNew but I wouldn't put too much hope into it although you never know and it might just work out so we should definitely try it :)

The reason I think so is because your use case is quite different from ours in that you use TaskScheduler.Default while we use custom SingleThreadTaskScheduler. TaskCreationOptions.LongRunning that we pass to Task.Factory.StartNew doesn't have any effect in our case because unlike TaskScheduler.Default that takes passed TaskCreationOptions into account our SingleThreadTaskScheduler doesn't read task.Options at all. We should remove that TaskCreationOptions.LongRunning argument to Task.Factory.StartNew.

After taking a look at the attached logs and I see that PayloadSenderV2's thread does run and it gets to the point where it serializes events to send:

2019-09-30 17:22:44: [Trace] Elastic.Apm: {PayloadSenderV2} There's data to be sent. Batch size: 2. newEventQueueCount: 0. First event: Transaction{Id: 2de0daaec8e15b4d, TraceId: 5863f49a218b0dbf19b9732f96430c25, ParentId: null, Name: GET /Index, Type: request, IsSampled: True}. 
2019-09-30 17:28:52: [Trace] Elastic.Apm: {PayloadSenderV2} Serialized item to send: Transaction{Id: 2de0daaec8e15b4d, TraceId: 5863f49a218b0dbf19b9732f96430c25, ParentId: null, Name: GET /Index, Type: request, IsSampled: True} as {"context":{"request":{"body":"REDACTED","headers":{"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","accept-Encoding":"gzip, deflate, br","accept-Language":"en-US,en;q=0.9","cache-control":"max-age=0","cloudFront-Forwarded-Proto":"https","cloudFront-Is-Desktop-Viewer":"true","cloudFront-Is-Mobile-Viewer":"false","cloudFront-Is-SmartTV-Viewer":"false","cloudFront-Is-Tablet-Viewer":"false","cloudFront-Viewer-Country":"AT","host":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","referer":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod","sec-fetch-mode":"navigate","sec-fetch-site":"same-origin","sec-fetch-user":"?1","upgrade-insecure-requests":"1","user-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36","via":"2.0 a01b7aca64c6d4b437b814f64422d6c8.cloudfront.net (CloudFront)","x-Amz-Cf-Id":"IqlnFb8q390I1Hnv70t097As8LaYHBQdEfxbUdM5l9viEqRN_ysJqw==","x-Amzn-Trace-Id":"Root=1-5d921da3-67f0626409acd2eeca612b88","x-Forwarded-For":"90.146.112.229, 130.176.13.139","x-Forwarded-Port":"443","x-Forwarded-Proto":"https"},"http_version":"unknown","method":"GET","socket":{"encrypted":true,"remote_address":"90.146.112.229"},"url":{"full":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","hostname":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","pathname":"/","raw":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","search":""}},"response":{"finished":false,"headers":{"strict-Transport-Security":"max-age=2592000","content-Type":"text/html; charset=utf-8"},"status_code":0}},"duration":1786.386,"id":"2de0daaec8e15b4d","sampled":true,"name":"GET /Index","result":" 0xx","span_count":{"dropped":0,"started":0},"timestamp":1569856933623677,"trace_id":"5863f49a218b0dbf19b9732f96430c25","type":"request"} 
2019-09-30 17:28:52: [Trace] Elastic.Apm: {PayloadSenderV2} Serialized item to send: Transaction{Id: 04e711fa35c60d5f, TraceId: 05ead4695331be1677fc618c17e8a404, ParentId: null, Name: GET /Index, Type: request, IsSampled: True} as {"context":{"request":{"body":"REDACTED","headers":{"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","accept-Encoding":"gzip, deflate, br","accept-Language":"en-US,en;q=0.9","cache-control":"max-age=0","cloudFront-Forwarded-Proto":"https","cloudFront-Is-Desktop-Viewer":"true","cloudFront-Is-Mobile-Viewer":"false","cloudFront-Is-SmartTV-Viewer":"false","cloudFront-Is-Tablet-Viewer":"false","cloudFront-Viewer-Country":"AT","host":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","referer":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod","sec-fetch-mode":"navigate","sec-fetch-site":"same-origin","sec-fetch-user":"?1","upgrade-insecure-requests":"1","user-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36","via":"2.0 a01b7aca64c6d4b437b814f64422d6c8.cloudfront.net (CloudFront)","x-Amz-Cf-Id":"bEIdSzHXEstlO_4a6jM0I4BXv7TXUeXSFLfGYBYabCvte0rM0v0nyw==","x-Amzn-Trace-Id":"Root=1-5d921da8-15a77a40921a11c05515e680","x-Forwarded-For":"90.146.112.229, 130.176.13.139","x-Forwarded-Port":"443","x-Forwarded-Proto":"https"},"http_version":"unknown","method":"GET","socket":{"encrypted":true,"remote_address":"90.146.112.229"},"url":{"full":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","hostname":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","pathname":"/","raw":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","search":""}},"response":{"finished":false,"headers":{"strict-Transport-Security":"max-age=2592000","content-Type":"text/html; charset=utf-8"},"status_code":0}},"duration":2.877,"id":"04e711fa35c60d5f","sampled":true,"name":"GET /Index","result":" 0xx","span_count":{"dropped":0,"started":0},"timestamp":1569856936202970,"trace_id":"05ead4695331be1677fc618c17e8a404","type":"request"} 

It already looks very strange that it took the thread more than 6 minutes to get from the point it detected events in the queue to the point it serialized them. It's not clear what happens with the following statement:

var result = await HttpClientInstance.PostAsync(BackendCommUtils.ApmServerEndpoints.IntakeV2EventsUrlPath, content, CtsInstance.Token);

Our HttpDiagnosticListenerImplBase sees that request is indeed sent out

2019-09-30 17:28:52: [Trace] Elastic.Apm: {HttpDiagnosticListenerImplBase} Called with key: `System.Net.Http.HttpRequestOut.Start', value: `{ Request = Method: POST, RequestUri: 'https://3858873f28774c15aba46a9172718b18.apm.us-east-1.aws.cloud.es.io/intake/v2/events', Version: 2.0, Content: System.Net.Http.StringContent, Headers:
{
  User-Agent: elasticapm-dotnet/1.0.1
  User-Agent: System.Net.Http/4.6.27714.0
  User-Agent: .NET_Core/2.1.11
  Authorization: Bearer kWSB8xdhSf7l5u96XS
  Content-Type: application/x-ndjson; charset=utf-8
} }' 
2019-09-30 17:28:52: [Trace] Elastic.Apm: {HttpDiagnosticListenerImplBase} Request URL (https://3858873f28774c15aba46a9172718b18.apm.us-east-1.aws.cloud.es.io/intake/v2/events) is filtered out - exiting 

But there's no System.Net.Http.HttpRequestOut.Stop event. There's an event with the key System.Net.Http.Request but according to the docs it's the same as System.Net.Http.HttpRequestOut.Start emitted for compatibility purposes.

So after trying Vadim's proposal, I would recommend investigating:

SergeyKleyman commented 5 years ago

@gregkalapos I see the following Error message in the logs:

2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id

Should we open a separate issue to investigate?

SergeyKleyman commented 5 years ago

Also it's important to note that agent's CentralConfigFetcher does connect to APM Server successfully

2019-09-30 17:22:14: [Trace] Elastic.Apm: {CentralConfigFetcher} Processing HTTP response...
+-> Response:
    StatusCode: 200, ReasonPhrase: 'OK', Version: 1.1, Content: System.Net.Http.HttpConnection+HttpConnectionResponseContent, Headers:
    {
      Date: Mon, 30 Sep 2019 15:22:14 GMT
      Connection: keep-alive
      Server: fp/f99c5a
      X-Found-Handling-Cluster: 3858873f28774c15aba46a9172718b18
      X-Found-Handling-Instance: instance-0000000000
      X-Found-Handling-Server: 172.25.50.85
      Cache-Control: must-revalidate, max-age=30
      ETag: "d41d8cd98f00b204e9800998ecf8427e"
      Content-Type: text/plain; charset=UTF-8
      Content-Length: 3
    }
+-> Response body [length: 3]:
    {}

Of course there's still a difference between central config and intake API's (for example HTTP GET vs POST)...

gregkalapos commented 5 years ago

@gregkalapos I see the following Error message in the logs:

2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id

Should we open a separate issue to investigate?

No, if the process is not running in a docker container, but it's a linux machine, I think it's normal that we can't parse the container id from /proc/self/cgroup - maybe we can change this error to a warning, or to Info. I'd leave it, unless someone asks us to do it.

gregkalapos commented 4 years ago

@weedaj

I think you mention here that you use the .NET APM Agent on AWS. Isn't this an issue for you?

...just out of interest.

Thanks

weedaj commented 4 years ago

@weedaj

I think you mention here that you use the .NET APM Agent on AWS. Isn't this an issue for you?

...just out of interest.

Thanks

@gregkalapos, thanks for the heads-up. In our environment we do seem to see less requests in the dashboard than expected. How would this issue manifest itself?

gregkalapos commented 4 years ago

In our environment we do seem to see less requests in the dashboard than expected. How would this issue manifest itself?

@weedaj I think it can be. My understanding was that the agent can't send any data at all - but maybe it's not the case. I tested it once, and the agent was not able to send anything in that case. It was a very simple setup with an ASP.NET Core application - 2.1 if I remember correctly.

hjmwijnen commented 4 years ago

2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id


Should we open a separate issue to investigate?

No, if the process is not running in a docker container, but it's a linux machine, I think it's normal that we can't parse the container id from /proc/self/cgroup - maybe we can change this error to a warning, or to Info. I'd leave it, unless someone asks us to do it.

Hi @gregkalapos, is there something else you would need to change the log level from error to something lower? Otherwise I'm asking you here.. :) Our logs are getting stucked with these errors, which are hiding more important ones.

gregkalapos commented 4 years ago

2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id

Should we open a separate issue to investigate?

No, if the process is not running in a docker container, but it's a linux machine, I think it's normal that we can't parse the container id from /proc/self/cgroup - maybe we can change this error to a warning, or to Info. I'd leave it, unless someone asks us to do it.

Hi @gregkalapos, is there something else you would need to change the log level from error to something lower? Otherwise I'm asking you here.. :) Our logs are getting stucked with these errors, which are hiding more important ones.

@hjmwijnen - it's ok to ask here :)

I opened https://github.com/elastic/apm-agent-dotnet/pull/823 - I think a Warning for this is totally reasonable - Info sounds a bit low to me. Would Warning instead of Error help you?

hjmwijnen commented 4 years ago

I opened #823 - I think a Warning for this is totally reasonable - Info sounds a bit low to me. Would Warning instead of Error help you?

definately! thanks! :)

gregkalapos commented 4 years ago

I opened #823 - I think a Warning for this is totally reasonable - Info sounds a bit low to me. Would Warning instead of Error help you?

definately! thanks! :)

Perfect, then we'll roll with warning. This will be included in the next release.

Thiago-Dantas commented 4 years ago

Another problem I noticed with aws lambda but only with their Mock Test Tool (https://github.com/aws/aws-lambda-dotnet/tree/master/Tools/LambdaTestTool)

Elastic APM just does not seem to want to play with AWS Mock Tool at all, as soon as I step into any code that calls APM it throws

---------------- Inner 1 Exception ------------
System.IO.FileLoadException: Could not load file or assembly 'Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.2.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. An operation is not legal in the current state. (0x80131509)
   at Elastic.Apm.Metrics.MetricsProvider.GcMetricsProvider..ctor(IApmLogger logger, Boolean collectGcCount, Boolean collectGcGen0Size, Boolean collectGcGen1Size, Boolean collectGcGen2Size, Boolean collectGcGen3Size)
   at Elastic.Apm.Metrics.MetricsCollector..ctor(IApmLogger logger, IPayloadSender payloadSender, IConfigurationReader configurationReader)
   at Elastic.Apm.AgentComponents..ctor(IApmLogger logger, IConfigurationReader configurationReader, IPayloadSender payloadSender, IMetricsCollector metricsCollector, ICurrentExecutionSegmentsContainer currentExecutionSegmentsContainer, ICentralConfigFetcher centralConfigFetcher)
   at Elastic.Apm.AgentComponents..ctor(IApmLogger logger, IConfigurationReader configurationReader, IPayloadSender payloadSender)
---------------- Inner 2 Exception ------------
System.InvalidOperationException: Resolved assembly's simple name should be the same as of the requested assembly.
   at System.Runtime.Loader.AssemblyLoadContext.ValidateAssemblyNameWithSimpleName(Assembly assembly, String requestedSimpleName)
   at System.Runtime.Loader.AssemblyLoadContext.ResolveUsingLoad(AssemblyName assemblyName)
   at System.Runtime.Loader.AssemblyLoadContext.Resolve(IntPtr gchManagedAssemblyLoadContext, AssemblyName assemblyName)

Removing the entire if (PlatformDetection.IsDotNetFullFramework) in Elastic.Apm/Metrics/MetricsProvider/GcMetricsProvider.cs fixes the issue locally

This issue is not reproducible in the actual aws lambda environment, only locally using their test tool

ffMathy commented 2 years ago

@AlexanderWert why does this keep getting moved? AWS Lambda is highly common, and it should be a relatively easy fix right?

AlexanderWert commented 2 years ago

@ffMathy

Thanks for reaching out and the feedback!

Supporting AWS Lambda involves more than just successfully sending data and, thus, is rather a bigger task:

We provide AWS Lambda support for Node.js, Python and Java, already.

We plan to provide proper AWS Lambda support for .NET as well, it's just a matter of prioritization amongst other things we have on the roadmap.

Feedback and indication of concrete need helps us prioritizing!

ffMathy commented 2 years ago

Oh, but I mean the null reference exception. It's still happening.

But thanks for the additional context.