Open gregkalapos opened 5 years ago
Some time ago we had a similar issue with long running task. Our service has very simple self-written background job:
public void Start(CancellationToken token)
{
Task.Factory.StartNew(
async () => RunAsync(token),
CancellationToken.None,
TaskCreationOptions.LongRunning,
TaskScheduler.Default);
}
private async Task RunAsync(CancellationToken cancellationToken)
{
while (!cancellationToken.IsCancellationRequested)
{
// do some job
}
}
From business perspective of view, service has heavy load from our users every end of month. At this time our background job stops unexpectedly: we don't see any effort which is doing in the job. In process dump no any thread is busy with background job. Service restart helps to fix service work. During investigation, we had an assumption, that under heavy load CLR triggers garbage collection in 2nd generation due to high memory consumption and by this reason task associated with background job is collected by GC. Under performance monitor our assumption was confirmed: just before stopping background job we have event from GC about collection garbage in 2nd generation. The fix in our solution was simple: store reference to returned task to internal field. It solved our issue.
In addition, during investigation we found out a very interesting article about async/await and LongRunning task: http://blog.i3arnon.com/2015/07/02/task-run-long-running/.
Returning to issue, which was registered in AWS, I found out the similar peace of code:
Also, as I see your AWS instance has low memory size (512 MB), if I look into right numbers. It can cause more aggressively garbage collection, which can destroy task. Unfortunately, I cannot reproduce the issue locally and by this reason cannot confirm that store reference to started task in a local field can solve the problem, however I suggest to try it and check once again in AWS. @gregkalapos, WDYT? It doesn't seem as a big change.
@vhatsura These are very interesting insights and it is indeed really easy to try out your proposal to hold the reference to the Task
returned by Task.Factory.StartNew
but I wouldn't put too much hope into it although you never know and it might just work out so we should definitely try it :)
The reason I think so is because your use case is quite different from ours in that you use TaskScheduler.Default
while we use custom SingleThreadTaskScheduler
. TaskCreationOptions.LongRunning
that we pass to Task.Factory.StartNew
doesn't have any effect in our case because unlike TaskScheduler.Default
that takes passed TaskCreationOptions
into account our SingleThreadTaskScheduler
doesn't read task.Options
at all. We should remove that TaskCreationOptions.LongRunning
argument to Task.Factory.StartNew
.
After taking a look at the attached logs and I see that PayloadSenderV2
's thread does run and it gets to the point where it serializes events to send:
2019-09-30 17:22:44: [Trace] Elastic.Apm: {PayloadSenderV2} There's data to be sent. Batch size: 2. newEventQueueCount: 0. First event: Transaction{Id: 2de0daaec8e15b4d, TraceId: 5863f49a218b0dbf19b9732f96430c25, ParentId: null, Name: GET /Index, Type: request, IsSampled: True}.
2019-09-30 17:28:52: [Trace] Elastic.Apm: {PayloadSenderV2} Serialized item to send: Transaction{Id: 2de0daaec8e15b4d, TraceId: 5863f49a218b0dbf19b9732f96430c25, ParentId: null, Name: GET /Index, Type: request, IsSampled: True} as {"context":{"request":{"body":"REDACTED","headers":{"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","accept-Encoding":"gzip, deflate, br","accept-Language":"en-US,en;q=0.9","cache-control":"max-age=0","cloudFront-Forwarded-Proto":"https","cloudFront-Is-Desktop-Viewer":"true","cloudFront-Is-Mobile-Viewer":"false","cloudFront-Is-SmartTV-Viewer":"false","cloudFront-Is-Tablet-Viewer":"false","cloudFront-Viewer-Country":"AT","host":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","referer":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod","sec-fetch-mode":"navigate","sec-fetch-site":"same-origin","sec-fetch-user":"?1","upgrade-insecure-requests":"1","user-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36","via":"2.0 a01b7aca64c6d4b437b814f64422d6c8.cloudfront.net (CloudFront)","x-Amz-Cf-Id":"IqlnFb8q390I1Hnv70t097As8LaYHBQdEfxbUdM5l9viEqRN_ysJqw==","x-Amzn-Trace-Id":"Root=1-5d921da3-67f0626409acd2eeca612b88","x-Forwarded-For":"90.146.112.229, 130.176.13.139","x-Forwarded-Port":"443","x-Forwarded-Proto":"https"},"http_version":"unknown","method":"GET","socket":{"encrypted":true,"remote_address":"90.146.112.229"},"url":{"full":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","hostname":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","pathname":"/","raw":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","search":""}},"response":{"finished":false,"headers":{"strict-Transport-Security":"max-age=2592000","content-Type":"text/html; charset=utf-8"},"status_code":0}},"duration":1786.386,"id":"2de0daaec8e15b4d","sampled":true,"name":"GET /Index","result":" 0xx","span_count":{"dropped":0,"started":0},"timestamp":1569856933623677,"trace_id":"5863f49a218b0dbf19b9732f96430c25","type":"request"}
2019-09-30 17:28:52: [Trace] Elastic.Apm: {PayloadSenderV2} Serialized item to send: Transaction{Id: 04e711fa35c60d5f, TraceId: 05ead4695331be1677fc618c17e8a404, ParentId: null, Name: GET /Index, Type: request, IsSampled: True} as {"context":{"request":{"body":"REDACTED","headers":{"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","accept-Encoding":"gzip, deflate, br","accept-Language":"en-US,en;q=0.9","cache-control":"max-age=0","cloudFront-Forwarded-Proto":"https","cloudFront-Is-Desktop-Viewer":"true","cloudFront-Is-Mobile-Viewer":"false","cloudFront-Is-SmartTV-Viewer":"false","cloudFront-Is-Tablet-Viewer":"false","cloudFront-Viewer-Country":"AT","host":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","referer":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod","sec-fetch-mode":"navigate","sec-fetch-site":"same-origin","sec-fetch-user":"?1","upgrade-insecure-requests":"1","user-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36","via":"2.0 a01b7aca64c6d4b437b814f64422d6c8.cloudfront.net (CloudFront)","x-Amz-Cf-Id":"bEIdSzHXEstlO_4a6jM0I4BXv7TXUeXSFLfGYBYabCvte0rM0v0nyw==","x-Amzn-Trace-Id":"Root=1-5d921da8-15a77a40921a11c05515e680","x-Forwarded-For":"90.146.112.229, 130.176.13.139","x-Forwarded-Port":"443","x-Forwarded-Proto":"https"},"http_version":"unknown","method":"GET","socket":{"encrypted":true,"remote_address":"90.146.112.229"},"url":{"full":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","hostname":"ijhkunomqk.execute-api.us-west-2.amazonaws.com","pathname":"/","raw":"https://ijhkunomqk.execute-api.us-west-2.amazonaws.com/Prod/","search":""}},"response":{"finished":false,"headers":{"strict-Transport-Security":"max-age=2592000","content-Type":"text/html; charset=utf-8"},"status_code":0}},"duration":2.877,"id":"04e711fa35c60d5f","sampled":true,"name":"GET /Index","result":" 0xx","span_count":{"dropped":0,"started":0},"timestamp":1569856936202970,"trace_id":"05ead4695331be1677fc618c17e8a404","type":"request"}
It already looks very strange that it took the thread more than 6 minutes to get from the point it detected events in the queue to the point it serialized them. It's not clear what happens with the following statement:
var result = await HttpClientInstance.PostAsync(BackendCommUtils.ApmServerEndpoints.IntakeV2EventsUrlPath, content, CtsInstance.Token);
Our HttpDiagnosticListenerImplBase
sees that request is indeed sent out
2019-09-30 17:28:52: [Trace] Elastic.Apm: {HttpDiagnosticListenerImplBase} Called with key: `System.Net.Http.HttpRequestOut.Start', value: `{ Request = Method: POST, RequestUri: 'https://3858873f28774c15aba46a9172718b18.apm.us-east-1.aws.cloud.es.io/intake/v2/events', Version: 2.0, Content: System.Net.Http.StringContent, Headers:
{
User-Agent: elasticapm-dotnet/1.0.1
User-Agent: System.Net.Http/4.6.27714.0
User-Agent: .NET_Core/2.1.11
Authorization: Bearer kWSB8xdhSf7l5u96XS
Content-Type: application/x-ndjson; charset=utf-8
} }'
2019-09-30 17:28:52: [Trace] Elastic.Apm: {HttpDiagnosticListenerImplBase} Request URL (https://3858873f28774c15aba46a9172718b18.apm.us-east-1.aws.cloud.es.io/intake/v2/events) is filtered out - exiting
But there's no System.Net.Http.HttpRequestOut.Stop
event. There's an event with the key System.Net.Http.Request
but according to the docs it's the same as System.Net.Http.HttpRequestOut.Start
emitted for compatibility purposes.
So after trying Vadim's proposal, I would recommend investigating:
@gregkalapos I see the following Error
message in the logs:
2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id
Should we open a separate issue to investigate?
Also it's important to note that agent's CentralConfigFetcher
does connect to APM Server successfully
2019-09-30 17:22:14: [Trace] Elastic.Apm: {CentralConfigFetcher} Processing HTTP response...
+-> Response:
StatusCode: 200, ReasonPhrase: 'OK', Version: 1.1, Content: System.Net.Http.HttpConnection+HttpConnectionResponseContent, Headers:
{
Date: Mon, 30 Sep 2019 15:22:14 GMT
Connection: keep-alive
Server: fp/f99c5a
X-Found-Handling-Cluster: 3858873f28774c15aba46a9172718b18
X-Found-Handling-Instance: instance-0000000000
X-Found-Handling-Server: 172.25.50.85
Cache-Control: must-revalidate, max-age=30
ETag: "d41d8cd98f00b204e9800998ecf8427e"
Content-Type: text/plain; charset=UTF-8
Content-Length: 3
}
+-> Response body [length: 3]:
{}
Of course there's still a difference between central config and intake API's (for example HTTP GET vs POST)...
@gregkalapos I see the following
Error
message in the logs:2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id
Should we open a separate issue to investigate?
No, if the process is not running in a docker container, but it's a linux machine, I think it's normal that we can't parse the container id from /proc/self/cgroup
- maybe we can change this error
to a warning
, or to Info
. I'd leave it, unless someone asks us to do it.
@weedaj
I think you mention here that you use the .NET APM Agent on AWS. Isn't this an issue for you?
...just out of interest.
Thanks
@weedaj
I think you mention here that you use the .NET APM Agent on AWS. Isn't this an issue for you?
...just out of interest.
Thanks
@gregkalapos, thanks for the heads-up. In our environment we do seem to see less requests in the dashboard than expected. How would this issue manifest itself?
In our environment we do seem to see less requests in the dashboard than expected. How would this issue manifest itself?
@weedaj I think it can be. My understanding was that the agent can't send any data at all - but maybe it's not the case. I tested it once, and the agent was not able to send anything in that case. It was a very simple setup with an ASP.NET Core application - 2.1 if I remember correctly.
2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id
Should we open a separate issue to investigate?
No, if the process is not running in a docker container, but it's a linux machine, I think it's normal that we can't parse the container id from
/proc/self/cgroup
- maybe we can change thiserror
to awarning
, or toInfo
. I'd leave it, unless someone asks us to do it.
Hi @gregkalapos, is there something else you would need to change the log level from error to something lower? Otherwise I'm asking you here.. :) Our logs are getting stucked with these errors, which are hiding more important ones.
2019-09-30 17:22:12: [Error] Elastic.Apm: {SystemInfoHelper} Failed parsing container id
Should we open a separate issue to investigate?
No, if the process is not running in a docker container, but it's a linux machine, I think it's normal that we can't parse the container id from
/proc/self/cgroup
- maybe we can change thiserror
to awarning
, or toInfo
. I'd leave it, unless someone asks us to do it.Hi @gregkalapos, is there something else you would need to change the log level from error to something lower? Otherwise I'm asking you here.. :) Our logs are getting stucked with these errors, which are hiding more important ones.
@hjmwijnen - it's ok to ask here :)
I opened https://github.com/elastic/apm-agent-dotnet/pull/823 - I think a Warning
for this is totally reasonable - Info
sounds a bit low to me. Would Warning
instead of Error
help you?
I opened #823 - I think a
Warning
for this is totally reasonable -Info
sounds a bit low to me. WouldWarning
instead ofError
help you?
definately! thanks! :)
I opened #823 - I think a
Warning
for this is totally reasonable -Info
sounds a bit low to me. WouldWarning
instead ofError
help you?definately! thanks! :)
Perfect, then we'll roll with warning. This will be included in the next release.
Another problem I noticed with aws lambda but only with their Mock Test Tool (https://github.com/aws/aws-lambda-dotnet/tree/master/Tools/LambdaTestTool)
Elastic APM just does not seem to want to play with AWS Mock Tool at all, as soon as I step into any code that calls APM it throws
---------------- Inner 1 Exception ------------
System.IO.FileLoadException: Could not load file or assembly 'Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.2.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. An operation is not legal in the current state. (0x80131509)
at Elastic.Apm.Metrics.MetricsProvider.GcMetricsProvider..ctor(IApmLogger logger, Boolean collectGcCount, Boolean collectGcGen0Size, Boolean collectGcGen1Size, Boolean collectGcGen2Size, Boolean collectGcGen3Size)
at Elastic.Apm.Metrics.MetricsCollector..ctor(IApmLogger logger, IPayloadSender payloadSender, IConfigurationReader configurationReader)
at Elastic.Apm.AgentComponents..ctor(IApmLogger logger, IConfigurationReader configurationReader, IPayloadSender payloadSender, IMetricsCollector metricsCollector, ICurrentExecutionSegmentsContainer currentExecutionSegmentsContainer, ICentralConfigFetcher centralConfigFetcher)
at Elastic.Apm.AgentComponents..ctor(IApmLogger logger, IConfigurationReader configurationReader, IPayloadSender payloadSender)
---------------- Inner 2 Exception ------------
System.InvalidOperationException: Resolved assembly's simple name should be the same as of the requested assembly.
at System.Runtime.Loader.AssemblyLoadContext.ValidateAssemblyNameWithSimpleName(Assembly assembly, String requestedSimpleName)
at System.Runtime.Loader.AssemblyLoadContext.ResolveUsingLoad(AssemblyName assemblyName)
at System.Runtime.Loader.AssemblyLoadContext.Resolve(IntPtr gchManagedAssemblyLoadContext, AssemblyName assemblyName)
Removing the entire if (PlatformDetection.IsDotNetFullFramework)
in Elastic.Apm/Metrics/MetricsProvider/GcMetricsProvider.cs
fixes the issue locally
This issue is not reproducible in the actual aws lambda environment, only locally using their test tool
@AlexanderWert why does this keep getting moved? AWS Lambda is highly common, and it should be a relatively easy fix right?
@ffMathy
Thanks for reaching out and the feedback!
Supporting AWS Lambda involves more than just successfully sending data and, thus, is rather a bigger task:
We provide AWS Lambda support for Node.js, Python and Java, already.
We plan to provide proper AWS Lambda support for .NET as well, it's just a matter of prioritization amongst other things we have on the roadmap.
Feedback and indication of concrete need helps us prioritizing!
Oh, but I mean the null reference exception. It's still happening.
But thanks for the additional context.
Follow-up from #491.
The nullref from #491 is fixed, but during testing I realized that the agent does not work in AWS. AWS offers a template for ASP.NET Core apps and with that probably we would not have our flushing problem.
But I realized that the agent is unable to send data to the APM Server - No exception is visible in the logs, but probably something is wrong in
PayloadSenderV2
. Logs attached: awslogs.txtI'm not sure this is AWS related - it's definitely an environment where this happens.