dotnet / dotnet-monitor

This repository contains the source code for .NET Monitor - a tool that allows you to gather diagnostic data from running applications using HTTP endpoints
MIT License
638 stars 111 forks source link

Unexpected timeout from process #1758

Open maciejsobocinski opened 2 years ago

maciejsobocinski commented 2 years ago

Hi,

We have dotnetmonitor set up on ECS Fargate. Running in listen mode collecting metrics every X. Our set up is a single dotnetmonitor side car inside each launched task with many tasks being launched. It stops working for us on some tasks after a few hours with the following error:

{
    "Timestamp": "2022-04-15T06:50:03.0758604Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
        "processId": "6",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

Then all subsequent requests get this error:

{
    "Timestamp": "2022-04-15T06:55:01.6363199Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:5f73f4ec6a4c2a06, TraceId:6e3bec22534dca3eed9ae13c8150dc0c, ParentId:0d6726492bd0e999",
            "SpanId": "5f73f4ec6a4c2a06",
            "TraceId": "6e3bec22534dca3eed9ae13c8150dc0c",
            "ParentId": "0d6726492bd0e999"
        },
        {
            "Message": "ConnectionId:0HMGU731FOFDF",
            "ConnectionId": "0HMGU731FOFDF"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMGU731FOFDF:00000002",
            "RequestId": "0HMGU731FOFDF:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "cc79e4d4-794e-481f-8083-fb3f3c7b5ca5",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics",
            "ArtifactType": "livemetrics"
        }
    ]
}

Note the main container itself keeps on working just fine and is processing requests without any issues. Per metrics captured before the error I do not see any abnormal memory/cpu/etc usage compared to the other tasks where dotnet-monitor keeps on working.

Here is our ecs task definition (the dotnetmonitor config values are under 'Environment'):

  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Cpu: !Ref TaskCpu
      Memory: !Ref TaskMemory
      NetworkMode: awsvpc
      ExecutionRoleArn: !Sub "arn:aws:iam::${AWS::AccountId}:role/ecsTaskExecutionRole"
      TaskRoleArn: !ImportValue AppServicesEcsTaskRoleArn
      RequiresCompatibilities:
        - FARGATE
      Volumes:
        - Name: tmp
      ContainerDefinitions:
        - Essential: true
          Name: appservices
          Image:
            !Sub
              - "${repository}:${image}"
              - repository: !ImportValue AppServicesEcrRepository
                image: !Ref TaskEcrImageTag
          Ulimits:
            - Name: nofile
              HardLimit: 65535
              SoftLimit: 65535
          PortMappings:
            - ContainerPort: 44392
              Protocol: tcp
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !ImportValue AppServicesEcsLogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref EnvironmentName
          LinuxParameters:
            InitProcessEnabled: true
            Capabilities:
              Add:
                - SYS_PTRACE
          StopTimeout: 120
          MountPoints:
            - ContainerPath: /tmp
              SourceVolume: tmp
          Environment:
            - Name: DOTNET_DiagnosticPorts
              Value: /tmp/port
          DependsOn:
            - ContainerName: dotnet-monitor
              Condition: START
        - Essential: true
          Name: dotnet-monitor
          Image:
            !Sub
            - "${repository}:${image}-dotnetmonitor"
            - repository: !ImportValue AppServicesEcrRepository
              image: !Ref TaskEcrImageTag
          MountPoints:
            - ContainerPath: /tmp
              SourceVolume: tmp
          Environment:
            - Name: Kestrel__Certificates__Default__Path
              Value: /tmp/cert.pfx
            - Name: DotnetMonitor_S3Bucket
              Value: !Sub '{{resolve:ssm:/appservices/${EnvironmentName}/integration.bulk.s3.bucket:1}}'
            - Name: DotnetMonitor_DefaultProcess__Filters__0__Key
              Value: ProcessName
            - Name: DotnetMonitor_DefaultProcess__Filters__0__Value
              Value: dotnet
            - Name: DotnetMonitor_DiagnosticPort__ConnectionMode
              Value: Listen
            - Name: DotnetMonitor_DiagnosticPort__EndpointName
              Value: /tmp/port
            - Name: DotnetMonitor_Storage__DumpTempFolder
              Value: /tmp
            - Name: DotnetMonitor_Egress__FileSystem__file__directoryPath
              Value: /tmp/gcdump
            - Name: DotnetMonitor_Egress__FileSystem__file__intermediateDirectoryPath
              Value: /tmp/gcdumptmp
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Type
              Value: EventCounter
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__ProviderName
              Value: System.Runtime
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__CounterName
              Value: working-set
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__GreaterThan
              Value: !Ref TaskMemoryAutoGCDump
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__SlidingWindowDuration
              Value: 00:00:05
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Type
              Value: CollectGCDump
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Name
              Value: GCDump
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Settings__Egress
              Value: file
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Type
              Value: Execute
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Path
              Value: /bin/sh
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Arguments
              Value: /app/gcdump.sh $(Actions.GCDump.EgressPath)
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCount
              Value: 1
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCountSlidingWindowDuration
              Value: 03:00:00
          Secrets:
            - Name: DotnetMonitor_Authentication__MonitorApiKey__Subject
              ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.subject"
            - Name: DotnetMonitor_Authentication__MonitorApiKey__PublicKey
              ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.publickey"
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !ImportValue AppServicesEcsLogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref EnvironmentName

And dockerfile to customize the default dotnet monitor container:

FROM mcr.microsoft.com/dotnet/monitor:6

RUN apk add curl && \
    apk add jq && \
    apk add aws-cli && \
    apk add dos2unix

RUN adduser -s /bin/true -u 1000 -D -h /app app \
  && chown -R "app" "/app"

COPY --chown=app:app --chmod=500 gcdump.sh /app/gcdump.sh
RUN dos2unix /app/gcdump.sh

USER app
maciejsobocinski commented 2 years ago

Another update on this. We are also having trouble with the rule above to collect a gc dump when working-set goes above 70%. GC dump work fine for us under low memory usage but under high it "seems like" it takes so long pausing the dotnet process in main container that our load balancer kills the task - health checks are set to 45 seconds of failures to relaunch. Looking at our perf logs during that time memory usage is 6678/8192mb (186mb of that is dotnet monitor) and cpu usage is 1004/2048 (only 1 of that is used by dotnet monitor). The weird thing is per the dotnet-monitor logs below gcdump times out after 3 seconds but it looks like it is still running inside the main container causing time outs/container restart.

Any tips/ideas/guidance on how to capture a gc dump in prod (we are trying to diagnose memory spikes)?

Here are dotnet-monitor logs of the gc dump rule being triggered, timing out in like 3 seconds then for the next minute or so main container requests are timing out/failing (presumably because a gc dump is still running?) and eventually the container(s) get killed by our load balancer:


{
    "Timestamp": "2022-04-19T03:04:46.3862671Z",
    "EventId": 36,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' trigger 'EventCounter' completed.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' trigger 'EventCounter' completed.",
        "ruleName": "HighMemoryRule",
        "triggerType": "EventCounter",
        "{OriginalFormat}": "Collection rule '{ruleName}' trigger '{triggerType}' completed."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:46.5268863Z",
    "EventId": 33,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' action 'CollectGCDump' started.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' action 'CollectGCDump' started.",
        "ruleName": "HighMemoryRule",
        "actionType": "CollectGCDump",
        "{OriginalFormat}": "Collection rule '{ruleName}' action '{actionType}' started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        },
        {
            "Message": "CollectionRuleActionType:CollectGCDump CollectionRuleActionIndex:0",
            "CollectionRuleActionType": "CollectGCDump",
            "CollectionRuleActionIndex": 0
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:46.5665901Z",
    "EventId": 33,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' action 'Execute' started.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' action 'Execute' started.",
        "ruleName": "HighMemoryRule",
        "actionType": "Execute",
        "{OriginalFormat}": "Collection rule '{ruleName}' action '{actionType}' started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        },
        {
            "Message": "CollectionRuleActionType:Execute CollectionRuleActionIndex:1",
            "CollectionRuleActionType": "Execute",
            "CollectionRuleActionIndex": 1
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:49.5671423Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 10. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 10. Process will no longer be monitored.",
        "processId": "10",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}
{
    "Timestamp": "2022-04-19T03:04:49.5714281Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:49.5779787Z",
    "EventId": 38,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' action 'CollectGCDump' failed.",
    "Exception": "System.Threading.Tasks.TaskCanceledException: A task was canceled.    at Microsoft.Diagnostics.Tools.Monitor.TaskExtensions.WithCancellation(Task task, CancellationToken token) in /_/src/Tools/dotnet-monitor/TaskExtensions.cs:line 37    at Microsoft.Diagnostics.Tools.Monitor.TaskExtensions.WithCancellation[T](Task`1 task, CancellationToken token) in /_/src/Tools/dotnet-monitor/TaskExtensions.cs:line 48    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.CollectionRuleActionBase`1.WaitForCompletionAsync(CancellationToken token) in /_/src/Tools/dotnet-monitor/CollectionRules/Actions/CollectionRuleActionBase.cs:line 64    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.ActionListExecutor.WaitForCompletion(CollectionRuleContext context, Action startCallback, IDictionary`2 allResults, ICollectionRuleAction action, CollectionRuleActionOptions actionOption, CancellationToken cancellationToken) in /_/src/Tools/dotnet-monitor/CollectionRules/ActionListExecutor.cs:line 160    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.ActionListExecutor.WaitForCompletion(CollectionRuleContext context, Action startCallback, IDictionary`2 allResults, ActionCompletionEntry entry, CancellationToken cancellationToken) in /_/src/Tools/dotnet-monitor/CollectionRules/ActionListExecutor.cs:line 177",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' action 'CollectGCDump' failed.",
        "ruleName": "HighMemoryRule",
        "actionType": "CollectGCDump",
        "{OriginalFormat}": "Collection rule '{ruleName}' action '{actionType}' failed."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        },
        {
            "Message": "CollectionRuleActionType:Execute CollectionRuleActionIndex:1",
            "CollectionRuleActionType": "Execute",
            "CollectionRuleActionIndex": 1
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:49.5977629Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Request canceled.",
    "State": {
        "Message": "Request canceled.",
        "{OriginalFormat}": "Request canceled."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        },
        {
            "Message": "CollectionRuleActionType:CollectGCDump CollectionRuleActionIndex:0",
            "CollectionRuleActionType": "CollectGCDump",
            "CollectionRuleActionIndex": 0
        },
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:10 ArtifactSource_RuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "10",
            "ArtifactSource_RuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:49.6023431Z",
    "EventId": 38,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' action 'Execute' failed.",
    "Exception": "System.Threading.Tasks.TaskCanceledException: A task was canceled.    at Microsoft.Diagnostics.Tools.Monitor.TaskExtensions.WithCancellation(Task task, CancellationToken token) in /_/src/Tools/dotnet-monitor/TaskExtensions.cs:line 37    at Microsoft.Diagnostics.Tools.Monitor.TaskExtensions.WithCancellation[T](Task`1 task, CancellationToken token) in /_/src/Tools/dotnet-monitor/TaskExtensions.cs:line 48    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.CollectionRuleActionBase`1.WaitForCompletionAsync(CancellationToken token) in /_/src/Tools/dotnet-monitor/CollectionRules/Actions/CollectionRuleActionBase.cs:line 64    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.ActionListExecutor.WaitForCompletion(CollectionRuleContext context, Action startCallback, IDictionary`2 allResults, ICollectionRuleAction action, CollectionRuleActionOptions actionOption, CancellationToken cancellationToken) in /_/src/Tools/dotnet-monitor/CollectionRules/ActionListExecutor.cs:line 160    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.ActionListExecutor.WaitForCompletion(CollectionRuleContext context, Action startCallback, IDictionary`2 allResults, ActionCompletionEntry entry, CancellationToken cancellationToken) in /_/src/Tools/dotnet-monitor/CollectionRules/ActionListExecutor.cs:line 177    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.ActionListExecutor.WaitForCompletion(CollectionRuleContext context, Action startCallback, IDictionary`2 allResults, ActionCompletionEntry entry, CancellationToken cancellationToken) in /_/src/Tools/dotnet-monitor/CollectionRules/ActionListExecutor.cs:line 185    at Microsoft.Diagnostics.Tools.Monitor.CollectionRules.Actions.ActionListExecutor.ExecuteActions(CollectionRuleContext context, Action startCallback, CancellationToken cancellationToken) in /_/src/Tools/dotnet-monitor/CollectionRules/ActionListExecutor.cs:line 77",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' action 'Execute' failed.",
        "ruleName": "HighMemoryRule",
        "actionType": "Execute",
        "{OriginalFormat}": "Collection rule '{ruleName}' action '{actionType}' failed."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        },
        {
            "Message": "CollectionRuleActionType:Execute CollectionRuleActionIndex:1",
            "CollectionRuleActionType": "Execute",
            "CollectionRuleActionIndex": 1
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:04:49.6056195Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:05:34.4132777Z",
    "EventId": 40,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Starting collection rules.",
    "State": {
        "Message": "Starting collection rules.",
        "{OriginalFormat}": "Starting collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:05:34.4137868Z",
    "EventId": 29,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' started.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' started.",
        "ruleName": "HighMemoryRule",
        "{OriginalFormat}": "Collection rule '{ruleName}' started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:05:34.4142123Z",
    "EventId": 35,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' trigger 'EventCounter' started.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' trigger 'EventCounter' started.",
        "ruleName": "HighMemoryRule",
        "triggerType": "EventCounter",
        "{OriginalFormat}": "Collection rule '{ruleName}' trigger '{triggerType}' started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1 CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:05:34.4164694Z",
    "EventId": 32,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules started.",
    "State": {
        "Message": "All collection rules started.",
        "{OriginalFormat}": "All collection rules started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:05:58.8419003Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 10. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 10. Process will no longer be monitored.",
        "processId": "10",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}
{
    "Timestamp": "2022-04-19T03:05:58.8422906Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:05:58.8516468Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:10 TargetRuntimeInstanceCookie:1af38edab5774d948ed55b3310dc80d1",
            "TargetProcessId": "10",
            "TargetRuntimeInstanceCookie": "1af38edab5774d948ed55b3310dc80d1"
        }
    ]
}
{
    "Timestamp": "2022-04-19T03:06:09.9610855Z",
    "EventId": 0,
    "LogLevel": "Information",
    "Category": "Microsoft.Hosting.Lifetime",
    "Message": "Application is shutting down...",
    "State": {
        "Message": "Application is shutting down...",
        "{OriginalFormat}": "Application is shutting down..."
    },
    "Scopes": []
}
maciejsobocinski commented 2 years ago

Glancing at the dotnet monitor code if feels like "CheckEndpointAsync" below ends up cancelling the gc dump and rule collection if the process is unresponsive for more than 2.5 seconds (because a gc dump is taking place) then rules are restarted and the same gc dump rule is invoked again (further pausing the runtime)? The first check against operationTrackerService only returns true if a trace or a dump are running...but not a gcdump:

 private async Task<bool> CheckEndpointAsync(EndpointInfo info, CancellationToken token)
        {
            // If a dump operation is in progress, the runtime is likely to not respond to
            // diagnostic requests. Do not check for responsiveness while the dump operation
            // is in progress.
            if (_operationTrackerService?.IsExecutingOperation(info) == true)
            {
                return true;
            }

            using var timeoutSource = new CancellationTokenSource();
            using var linkedSource = CancellationTokenSource.CreateLinkedTokenSource(token, timeoutSource.Token);

            try
            {
                timeoutSource.CancelAfter(PruneWaitForConnectionTimeout);

                await info.Endpoint.WaitForConnectionAsync(linkedSource.Token).ConfigureAwait(false);
            }
            catch (OperationCanceledException) when (timeoutSource.IsCancellationRequested)
            {
                _logger?.EndpointTimeout(info.ProcessId.ToString(System.Globalization.CultureInfo.InvariantCulture));
                return false;
            }
            catch (Exception ex) when (ex is not OperationCanceledException)
            {
                return false;
            }

            return true;
        }
jander-msft commented 2 years ago

@maciejsobocinski, thanks for reporting this and digging into it. Have a few questions for you:

Our set up is a single dotnetmonitor side car inside each launched task with many tasks being launched. It stops working for us on some tasks after a few hours with the following error:

Is this occurring randomly, or only after an impactful operation happens such as capturing a gcdump?

The first check against operationTrackerService only returns true if a trace or a dump are running...but not a gcdump

My understanding is that the diagnostic communication channel should still be viable while a gcdump is being collected. @josalem, do you know if there is some kind of lock that is held while GC data is collected that would prevent the diagnostic communication channel from responding? If this is expected, we can add gcdump operations to the tracking service in dotnet monitor to prevent pruning of the diagnostic connection. However, even with the pruning (which disposes the stream), I would still expect the runtime instance to try to reestablish the diagnostic connection again, which it does not seem to do (or maybe we don't see that because it is still collecting GC data).

maciejsobocinski commented 2 years ago

@jander-msft both I believe. My initial ticket we had 4 tasks (each with it's own dotnet monitor) started up. Every 5 minutes hitting the /livemetrics endpoint . About 12 hrs in dotnet-monitor started failing inside one of them. There was no gc dump ran at the time per the logs (have it set to auto gc dump at 70%) memory usage was around 17% with cpu basically idle. This is the log entry:

{
    "Timestamp": "2022-04-15T06:50:02.0842638Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:be2da2ba85a57e19, TraceId:8bd2f78ca29aec12951eabd01e36a7ef, ParentId:f2394daea1f71aab",
            "SpanId": "be2da2ba85a57e19",
            "TraceId": "8bd2f78ca29aec12951eabd01e36a7ef",
            "ParentId": "f2394daea1f71aab"
        },
        {
            "Message": "ConnectionId:0HMGU731FOFDE",
            "ConnectionId": "0HMGU731FOFDE"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMGU731FOFDE:00000002",
            "RequestId": "0HMGU731FOFDE:00000002",
            "RequestPath": "/livemetrics"
        }
    ]
}

{
    "Timestamp": "2022-04-15T06:50:03.0758604Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
        "processId": "6",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

subsequent requests to /livemetrics:

{
    "Timestamp": "2022-04-15T06:55:01.6363199Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:5f73f4ec6a4c2a06, TraceId:6e3bec22534dca3eed9ae13c8150dc0c, ParentId:0d6726492bd0e999",
            "SpanId": "5f73f4ec6a4c2a06",
            "TraceId": "6e3bec22534dca3eed9ae13c8150dc0c",
            "ParentId": "0d6726492bd0e999"
        },
        {
            "Message": "ConnectionId:0HMGU731FOFDF",
            "ConnectionId": "0HMGU731FOFDF"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMGU731FOFDF:00000002",
            "RequestId": "0HMGU731FOFDF:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "cc79e4d4-794e-481f-8083-fb3f3c7b5ca5",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics",
            "ArtifactType": "livemetrics"
        }
    ]
}

At that point I disabled the /livemetrics polling but left the gc dump on 70% working-set in place. 4 days later the rule hit and failed with the 'CollectGCDump' failed/unexpected time out. It happened 3 times across different containers (the ones that failed were killed by our load balancer and restarted) within a few hrs.

maciejsobocinski commented 2 years ago

I was able to somewhat repro this in our dev environment. I manually triggered a gc dump when memory usage was low on the task (660/4096mb). gc dump was successful (unlike in our prod environment where it failed completely trying to do a gc dump with memory at 6678/8192mb) but also got the "Unexpected timeout" and further requests stopped working:

{
    "Timestamp": "2022-04-19T17:29:02.1838974Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:5d0a7651ef64ae52, TraceId:9087c513b99f163bfc5d9ad466a24d1c, ParentId:236c1fd495300808",
            "SpanId": "5d0a7651ef64ae52",
            "TraceId": "9087c513b99f163bfc5d9ad466a24d1c",
            "ParentId": "236c1fd495300808"
        },
        {
            "Message": "ConnectionId:0HMH22ID4PGJD",
            "ConnectionId": "0HMH22ID4PGJD"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH22ID4PGJD:00000002",
            "RequestId": "0HMH22ID4PGJD:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-19T17:29:02.4566427Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
        "processId": "7",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

{
    "Timestamp": "2022-04-19T17:29:02.4612395Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:c1de3bbadd104c679b6258ba69f40d69",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "c1de3bbadd104c679b6258ba69f40d69"
        }
    ]
}

{
    "Timestamp": "2022-04-19T17:29:02.7642136Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:c1de3bbadd104c679b6258ba69f40d69",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "c1de3bbadd104c679b6258ba69f40d69"
        }
    ]
}

{
    "Timestamp": "2022-04-19T17:29:04.1995862Z",
    "EventId": 5,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.OutputStreamResult",
    "Message": "Written to HTTP stream.",
    "State": {
        "Message": "Written to HTTP stream.",
        "{OriginalFormat}": "Written to HTTP stream."
    },
    "Scopes": [
        {
            "Message": "SpanId:5d0a7651ef64ae52, TraceId:9087c513b99f163bfc5d9ad466a24d1c, ParentId:236c1fd495300808",
            "SpanId": "5d0a7651ef64ae52",
            "TraceId": "9087c513b99f163bfc5d9ad466a24d1c",
            "ParentId": "236c1fd495300808"
        },
        {
            "Message": "ConnectionId:0HMH22ID4PGJD",
            "ConnectionId": "0HMH22ID4PGJD"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH22ID4PGJD:00000002",
            "RequestId": "0HMH22ID4PGJD:00000002",
            "RequestPath": "/gcdump"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureGcDump (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "b011b956-5606-4899-a3e0-3b07d50140b9",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureGcDump (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:c1de3bbadd104c679b6258ba69f40d69",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "c1de3bbadd104c679b6258ba69f40d69"
        }
    ]
}

{
    "Timestamp": "2022-04-19T17:30:04.0298055Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:8f51c0ee93ee397a, TraceId:0e45b3f64604b64e4836c8e12e0a6c29, ParentId:54634c8d96e978c7",
            "SpanId": "8f51c0ee93ee397a",
            "TraceId": "0e45b3f64604b64e4836c8e12e0a6c29",
            "ParentId": "54634c8d96e978c7"
        },
        {
            "Message": "ConnectionId:0HMH22ID4PGJD",
            "ConnectionId": "0HMH22ID4PGJD"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMH22ID4PGJD:00000003",
            "RequestId": "0HMH22ID4PGJD:00000003",
            "RequestPath": "/livemetrics"
        }
    ]
}

{
    "Timestamp": "2022-04-19T17:30:04.0309405Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:8f51c0ee93ee397a, TraceId:0e45b3f64604b64e4836c8e12e0a6c29, ParentId:54634c8d96e978c7",
            "SpanId": "8f51c0ee93ee397a",
            "TraceId": "0e45b3f64604b64e4836c8e12e0a6c29",
            "ParentId": "54634c8d96e978c7"
        },
        {
            "Message": "ConnectionId:0HMH22ID4PGJD",
            "ConnectionId": "0HMH22ID4PGJD"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMH22ID4PGJD:00000003",
            "RequestId": "0HMH22ID4PGJD:00000003",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "79857f94-77a3-4950-9596-99bb1e874bff",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics",
            "ArtifactType": "livemetrics"
        }
    ]
}
jander-msft commented 2 years ago

When want to eliminate (or possibly blame) the possibility that the socket file has been deleted by something, which would cause the subsequent reconnection of the runtime to the dotnet monitor tool to not occur.

maciejsobocinski commented 2 years ago

Hi,

Thank you for looking into this! I can confirm that /tmp/port still exists after timeout. I did get a slightly different behavior today though. GCDump worked just fine several times on a container that was launched only 2hrs ago with next to no traffic going to it (in our dev environment). Eventually I did get the time out in the logs but gcdump succeeded and further requests kept on working (where as yesterday 'unable to discover target process') - I did remove the collection rule that was the one difference I'm going to add it back and see if it changes anything (I'm testing through hitting the http endpoint though so not sure that it matters). I will update the settings to /diag as well and report back.

{
    "Timestamp": "2022-04-20T14:53:07.4891490Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:653424d1c4071427, TraceId:337acd08f914a23c757f608b79ab2b21, ParentId:0000000000000000",
            "SpanId": "653424d1c4071427",
            "TraceId": "337acd08f914a23c757f608b79ab2b21",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2P607OT3P",
            "ConnectionId": "0HMH2P607OT3P"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2P607OT3P:00000002",
            "RequestId": "0HMH2P607OT3P:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T14:53:08.0951240Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
        "processId": "7",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

{
    "Timestamp": "2022-04-20T14:53:08.0998805Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:8e064fa4979846d497866ec3cad84db3",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "8e064fa4979846d497866ec3cad84db3"
        }
    ]
}

{
    "Timestamp": "2022-04-20T14:53:08.1010130Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:8e064fa4979846d497866ec3cad84db3",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "8e064fa4979846d497866ec3cad84db3"
        }
    ]
}

{
    "Timestamp": "2022-04-20T14:53:13.4598955Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_145307_493_7.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_145307_493_7.gcdump",
        "location": "/tmp/gcdump/20220420_145307_493_7.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:8e064fa4979846d497866ec3cad84db3",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "8e064fa4979846d497866ec3cad84db3"
        }
    ]
}

{
    "Timestamp": "2022-04-20T14:54:18.5851347Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:417bbd375cbd0870, TraceId:157005f4185c24249dea2f87133b890f, ParentId:0000000000000000",
            "SpanId": "417bbd375cbd0870",
            "TraceId": "157005f4185c24249dea2f87133b890f",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2P607OT3Q",
            "ConnectionId": "0HMH2P607OT3Q"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2P607OT3Q:00000002",
            "RequestId": "0HMH2P607OT3Q:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T14:54:20.2243724Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_145418_585_7.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_145418_585_7.gcdump",
        "location": "/tmp/gcdump/20220420_145418_585_7.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:8e064fa4979846d497866ec3cad84db3",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "8e064fa4979846d497866ec3cad84db3"
        }
    ]
}
maciejsobocinski commented 2 years ago

I re-enabled the collection rule on high memory. Then started making calls to /gcdump endpoint (rule never triggered). Freshly launched container the first 10 or so gcdumps were successful then it failed again with timeout. This time around further requests are failing again. /tmp/port is still there. Not sure if it's random luck or because I re-enabled the rule (my prior comment was testing without rule got it to timeout but further requests kept on working):

{
    "Timestamp": "2022-04-20T16:19:11.1758418Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:d08610fd79f115b9, TraceId:0333d198dc6ad02684724b0425818337, ParentId:0000000000000000",
            "SpanId": "d08610fd79f115b9",
            "TraceId": "0333d198dc6ad02684724b0425818337",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2T3ELTG56",
            "ConnectionId": "0HMH2T3ELTG56"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2T3ELTG56:00000002",
            "RequestId": "0HMH2T3ELTG56:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:13.8908825Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_161911_176_9.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_161911_176_9.gcdump",
        "location": "/tmp/gcdump/20220420_161911_176_9.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:9 ArtifactSource_RuntimeInstanceCookie:672855941e7f4fc2a9f5497110c9a870",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "9",
            "ArtifactSource_RuntimeInstanceCookie": "672855941e7f4fc2a9f5497110c9a870"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:17.4243122Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:a7d39e4d1c10fd33, TraceId:b3dec8791469c2c5776ef9026c041fac, ParentId:0000000000000000",
            "SpanId": "a7d39e4d1c10fd33",
            "TraceId": "b3dec8791469c2c5776ef9026c041fac",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2T3ELTG57",
            "ConnectionId": "0HMH2T3ELTG57"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2T3ELTG57:00000002",
            "RequestId": "0HMH2T3ELTG57:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:17.9434247Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 9. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 9. Process will no longer be monitored.",
        "processId": "9",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

{
    "Timestamp": "2022-04-20T16:19:17.9476126Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:9 TargetRuntimeInstanceCookie:672855941e7f4fc2a9f5497110c9a870",
            "TargetProcessId": "9",
            "TargetRuntimeInstanceCookie": "672855941e7f4fc2a9f5497110c9a870"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:18.1395307Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:9 TargetRuntimeInstanceCookie:672855941e7f4fc2a9f5497110c9a870",
            "TargetProcessId": "9",
            "TargetRuntimeInstanceCookie": "672855941e7f4fc2a9f5497110c9a870"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:20.7208460Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_161917_425_9.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_161917_425_9.gcdump",
        "location": "/tmp/gcdump/20220420_161917_425_9.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:9 ArtifactSource_RuntimeInstanceCookie:672855941e7f4fc2a9f5497110c9a870",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "9",
            "ArtifactSource_RuntimeInstanceCookie": "672855941e7f4fc2a9f5497110c9a870"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:38.0013028Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:33460be09a652da9, TraceId:a8fa906d28d743cbba5ba7fa142d9bce, ParentId:0000000000000000",
            "SpanId": "33460be09a652da9",
            "TraceId": "a8fa906d28d743cbba5ba7fa142d9bce",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2T3ELTG58",
            "ConnectionId": "0HMH2T3ELTG58"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2T3ELTG58:00000002",
            "RequestId": "0HMH2T3ELTG58:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T16:19:38.0032943Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 704 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 57",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:33460be09a652da9, TraceId:a8fa906d28d743cbba5ba7fa142d9bce, ParentId:0000000000000000",
            "SpanId": "33460be09a652da9",
            "TraceId": "a8fa906d28d743cbba5ba7fa142d9bce",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2T3ELTG58",
            "ConnectionId": "0HMH2T3ELTG58"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2T3ELTG58:00000002",
            "RequestId": "0HMH2T3ELTG58:00000002",
            "RequestPath": "/gcdump"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureGcDump (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "fc42e51f-5d23-4dff-bef0-3a2f0d017235",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureGcDump (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:gcdump",
            "ArtifactType": "gcdump"
        }
    ]
}
maciejsobocinski commented 2 years ago

So I moved the port to /diag/port and DumpsTempFolder to /dumps like the docs recommended. Same problem :-( Newly spun up container first couple gc dumps work, then get "unexpected timeout" but gc dump works and I can make subsequent requests BUT then it happens again with "unexpected timeout" and then goes to "unable to discover a target process" (/diag/port still exists):

{
    "Timestamp": "2022-04-20T18:50:07.4083755Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:8d03d15bc184093c, TraceId:f0e5b86b4163b69ac1a9173ac243f82f, ParentId:0000000000000000",
            "SpanId": "8d03d15bc184093c",
            "TraceId": "f0e5b86b4163b69ac1a9173ac243f82f",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2V9I11MVO",
            "ConnectionId": "0HMH2V9I11MVO"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2V9I11MVO:00000002",
            "RequestId": "0HMH2V9I11MVO:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:09.1526928Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
        "processId": "7",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

{
    "Timestamp": "2022-04-20T18:50:09.1659161Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:09.2524267Z",
    "EventId": 40,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Starting collection rules.",
    "State": {
        "Message": "Starting collection rules.",
        "{OriginalFormat}": "Starting collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:09.2525991Z",
    "EventId": 29,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' started.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' started.",
        "ruleName": "HighMemoryRule",
        "{OriginalFormat}": "Collection rule '{ruleName}' started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f",
            "CollectionRuleName": "HighMemoryRule"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:09.2527364Z",
    "EventId": 35,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Collection rule 'HighMemoryRule' trigger 'EventCounter' started.",
    "State": {
        "Message": "Collection rule 'HighMemoryRule' trigger 'EventCounter' started.",
        "ruleName": "HighMemoryRule",
        "triggerType": "EventCounter",
        "{OriginalFormat}": "Collection rule '{ruleName}' trigger '{triggerType}' started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f CollectionRuleName:HighMemoryRule",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f",
            "CollectionRuleName": "HighMemoryRule"
        },
        {
            "Message": "CollectionRuleTriggerType:EventCounter",
            "CollectionRuleTriggerType": "EventCounter"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:09.2793160Z",
    "EventId": 32,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules started.",
    "State": {
        "Message": "All collection rules started.",
        "{OriginalFormat}": "All collection rules started."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:09.3916058Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:11.0080567Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_185007_410_7.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_185007_410_7.gcdump",
        "location": "/tmp/gcdump/20220420_185007_410_7.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:11.0699909Z",
    "EventId": 5,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.OutputStreamResult",
    "Message": "Written to HTTP stream.",
    "State": {
        "Message": "Written to HTTP stream.",
        "{OriginalFormat}": "Written to HTTP stream."
    },
    "Scopes": [
        {
            "Message": "SpanId:ec6999cc44d5cd93, TraceId:b3ad826b2977b98b0d239fdd94164cba, ParentId:8eefce7af93b1945",
            "SpanId": "ec6999cc44d5cd93",
            "TraceId": "b3ad826b2977b98b0d239fdd94164cba",
            "ParentId": "8eefce7af93b1945"
        },
        {
            "Message": "ConnectionId:0HMH2V9I11MVN",
            "ConnectionId": "0HMH2V9I11MVN"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMH2V9I11MVN:00000002",
            "RequestId": "0HMH2V9I11MVN:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "7285c1f5-03cd-4516-a55c-b35a48ba044a",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "ArtifactType": "livemetrics",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:30.6010840Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:7ca19d809cb0a1dc, TraceId:58a4a0511e3aaf467a748a19d2646acb, ParentId:0000000000000000",
            "SpanId": "7ca19d809cb0a1dc",
            "TraceId": "58a4a0511e3aaf467a748a19d2646acb",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2V9I11MVP",
            "ConnectionId": "0HMH2V9I11MVP"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2V9I11MVP:00000002",
            "RequestId": "0HMH2V9I11MVP:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:50:34.3168668Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_185030_606_7.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_185030_606_7.gcdump",
        "location": "/tmp/gcdump/20220420_185030_606_7.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:51:12.3892161Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:ad90d96e4a48dafe, TraceId:18da0341234ecb0cce826273da5385a6, ParentId:0000000000000000",
            "SpanId": "ad90d96e4a48dafe",
            "TraceId": "18da0341234ecb0cce826273da5385a6",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2V9I11MVR",
            "ConnectionId": "0HMH2V9I11MVR"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2V9I11MVR:00000002",
            "RequestId": "0HMH2V9I11MVR:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:51:12.8116760Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 7. Process will no longer be monitored.",
        "processId": "7",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

{
    "Timestamp": "2022-04-20T18:51:12.8119827Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:51:13.0701847Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:7 TargetRuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "TargetProcessId": "7",
            "TargetRuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:51:15.8297924Z",
    "EventId": 4,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.EgressOperation",
    "Message": "Egressed artifact to /tmp/gcdump/20220420_185112_394_7.gcdump",
    "State": {
        "Message": "Egressed artifact to /tmp/gcdump/20220420_185112_394_7.gcdump",
        "location": "/tmp/gcdump/20220420_185112_394_7.gcdump",
        "{OriginalFormat}": "Egressed artifact to {location}"
    },
    "Scopes": [
        {
            "Message": "ArtifactType:gcdump ArtifactSource_ProcessId:7 ArtifactSource_RuntimeInstanceCookie:b53e1d6ac48e4dcb8dcf59857e24a17f",
            "ArtifactType": "gcdump",
            "ArtifactSource_ProcessId": "7",
            "ArtifactSource_RuntimeInstanceCookie": "b53e1d6ac48e4dcb8dcf59857e24a17f"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:51:28.7445777Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:882994a4060b7795, TraceId:3c8ad5e70d400f8acc73f15c6e5e3af9, ParentId:0000000000000000",
            "SpanId": "882994a4060b7795",
            "TraceId": "3c8ad5e70d400f8acc73f15c6e5e3af9",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2V9I11MVS",
            "ConnectionId": "0HMH2V9I11MVS"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2V9I11MVS:00000002",
            "RequestId": "0HMH2V9I11MVS:00000002",
            "RequestPath": "/gcdump"
        }
    ]
}

{
    "Timestamp": "2022-04-20T18:51:28.7681109Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:882994a4060b7795, TraceId:3c8ad5e70d400f8acc73f15c6e5e3af9, ParentId:0000000000000000",
            "SpanId": "882994a4060b7795",
            "TraceId": "3c8ad5e70d400f8acc73f15c6e5e3af9",
            "ParentId": "0000000000000000"
        },
        {
            "Message": "ConnectionId:0HMH2V9I11MVS",
            "ConnectionId": "0HMH2V9I11MVS"
        },
        {
            "Message": "RequestPath:/gcdump RequestId:0HMH2V9I11MVS:00000002",
            "RequestId": "0HMH2V9I11MVS:00000002",
            "RequestPath": "/gcdump"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureGcDump (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "ae104f35-1f39-4b8f-84f3-014747a5c8b0",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureGcDump (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:gcdump",
            "ArtifactType": "gcdump"
        }
    ]
}
SebastianStehle commented 2 years ago

I have the same problem. I installed dotnet-monitor on 2 deployments. One of them works fine, but the other looses connection after some time.

I start the deployment make a few gcdumps over the endpoint, then come back 30 minutes later and it does not work anymore.

EDIT: When I restart the monitor container (not the pod) it works again. EDIT 2: The restart has worked three times now.

maciejsobocinski commented 2 years ago

I just saw a new version came out that had code changes to address this. I re-enabled the container in our development environment. It is looking good so far - as in I took 30+ gc dumps continuously to test with no issues where as previously I could trigger the scenario above to occur fairly quickly. Will deploy the change to production later this week to run against real-world usage hopefully looks good there as well. Thank you!

edit: Closing this ticket as well so I don't forget since I believe this to be working now. I'll re-open if I see different results in our prod environment.

maciejsobocinski commented 2 years ago

So we just deployed the latest dotnet monitor to prod. I re-enabled hitting /livemetrics every 5 minutes ... within 30 minutes or so 1 out of 4 containers started getting the same error message from dotnet monitor :-( . There was no GC dumps taken and the container never stopped processing user requests successfully.

{
    "Timestamp": "2022-07-01T14:00:01.0111921Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:9fa40d2041202592, TraceId:ae38f7c3e4d83e8559019aee39ea50d9, ParentId:dbe31cb4d270f899",
            "SpanId": "9fa40d2041202592",
            "TraceId": "ae38f7c3e4d83e8559019aee39ea50d9",
            "ParentId": "dbe31cb4d270f899"
        },
        {
            "Message": "ConnectionId:0HMIRCMP5M5GF",
            "ConnectionId": "0HMIRCMP5M5GF"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMIRCMP5M5GF:00000002",
            "RequestId": "0HMIRCMP5M5GF:00000002",
            "RequestPath": "/livemetrics"
        }
    ]
}
{
    "Timestamp": "2022-07-01T14:00:08.1066647Z",
    "EventId": 5,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.OutputStreamResult",
    "Message": "Written to HTTP stream.",
    "State": {
        "Message": "Written to HTTP stream.",
        "{OriginalFormat}": "Written to HTTP stream."
    },
    "Scopes": [
        {
            "Message": "SpanId:9fa40d2041202592, TraceId:ae38f7c3e4d83e8559019aee39ea50d9, ParentId:dbe31cb4d270f899",
            "SpanId": "9fa40d2041202592",
            "TraceId": "ae38f7c3e4d83e8559019aee39ea50d9",
            "ParentId": "dbe31cb4d270f899"
        },
        {
            "Message": "ConnectionId:0HMIRCMP5M5GF",
            "ConnectionId": "0HMIRCMP5M5GF"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMIRCMP5M5GF:00000002",
            "RequestId": "0HMIRCMP5M5GF:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "8accbb03-0591-4ce4-8407-5a631e02c09d",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics ArtifactSource_ProcessId:8 ArtifactSource_RuntimeInstanceCookie:08449004064f4286bffdf0f33935d4b3",
            "ArtifactType": "livemetrics",
            "ArtifactSource_ProcessId": "8",
            "ArtifactSource_RuntimeInstanceCookie": "08449004064f4286bffdf0f33935d4b3"
        }
    ]
}

{
    "Timestamp": "2022-07-01T14:05:02.0873006Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:30687ae247c20cfb, TraceId:ef5426eeaf912d015c59f0eebf0bf9e5, ParentId:2d02014ed370a9e5",
            "SpanId": "30687ae247c20cfb",
            "TraceId": "ef5426eeaf912d015c59f0eebf0bf9e5",
            "ParentId": "2d02014ed370a9e5"
        },
        {
            "Message": "ConnectionId:0HMIRCMP5M5GG",
            "ConnectionId": "0HMIRCMP5M5GG"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMIRCMP5M5GG:00000002",
            "RequestId": "0HMIRCMP5M5GG:00000002",
            "RequestPath": "/livemetrics"
        }
    ]
}

{
    "Timestamp": "2022-07-01T14:05:08.0963972Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 8. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 8. Process will no longer be monitored.",
        "processId": "8",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

{
    "Timestamp": "2022-07-01T14:05:08.0970551Z",
    "EventId": 5,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.OutputStreamResult",
    "Message": "Written to HTTP stream.",
    "State": {
        "Message": "Written to HTTP stream.",
        "{OriginalFormat}": "Written to HTTP stream."
    },
    "Scopes": [
        {
            "Message": "SpanId:30687ae247c20cfb, TraceId:ef5426eeaf912d015c59f0eebf0bf9e5, ParentId:2d02014ed370a9e5",
            "SpanId": "30687ae247c20cfb",
            "TraceId": "ef5426eeaf912d015c59f0eebf0bf9e5",
            "ParentId": "2d02014ed370a9e5"
        },
        {
            "Message": "ConnectionId:0HMIRCMP5M5GG",
            "ConnectionId": "0HMIRCMP5M5GG"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMIRCMP5M5GG:00000002",
            "RequestId": "0HMIRCMP5M5GG:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "8accbb03-0591-4ce4-8407-5a631e02c09d",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics ArtifactSource_ProcessId:8 ArtifactSource_RuntimeInstanceCookie:08449004064f4286bffdf0f33935d4b3",
            "ArtifactType": "livemetrics",
            "ArtifactSource_ProcessId": "8",
            "ArtifactSource_RuntimeInstanceCookie": "08449004064f4286bffdf0f33935d4b3"
        }
    ]
}

{
    "Timestamp": "2022-07-01T14:05:08.1019287Z",
    "EventId": 44,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "Stopping collection rules.",
    "State": {
        "Message": "Stopping collection rules.",
        "{OriginalFormat}": "Stopping collection rules."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:8 TargetRuntimeInstanceCookie:08449004064f4286bffdf0f33935d4b3",
            "TargetProcessId": "8",
            "TargetRuntimeInstanceCookie": "08449004064f4286bffdf0f33935d4b3"
        }
    ]
}

{
    "Timestamp": "2022-07-01T14:05:08.2234506Z",
    "EventId": 45,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService",
    "Message": "All collection rules have stopped.",
    "State": {
        "Message": "All collection rules have stopped.",
        "{OriginalFormat}": "All collection rules have stopped."
    },
    "Scopes": [
        {
            "Message": "TargetProcessId:8 TargetRuntimeInstanceCookie:08449004064f4286bffdf0f33935d4b3",
            "TargetProcessId": "8",
            "TargetRuntimeInstanceCookie": "08449004064f4286bffdf0f33935d4b3"
        }
    ]
}

{
    "Timestamp": "2022-07-01T14:10:03.5723478Z",
    "EventId": 2,
    "LogLevel": "Information",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.LiveJwtBearerHandler",
    "Message": "Successfully validated the token.",
    "State": {
        "Message": "Successfully validated the token.",
        "{OriginalFormat}": "Successfully validated the token."
    },
    "Scopes": [
        {
            "Message": "SpanId:cab9dde8f67d58fb, TraceId:f466c3e8ebb202208a5ec6d42f013591, ParentId:263c08c90ec8929c",
            "SpanId": "cab9dde8f67d58fb",
            "TraceId": "f466c3e8ebb202208a5ec6d42f013591",
            "ParentId": "263c08c90ec8929c"
        },
        {
            "Message": "ConnectionId:0HMIRCMP5M5GH",
            "ConnectionId": "0HMIRCMP5M5GH"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMIRCMP5M5GH:00000002",
            "RequestId": "0HMIRCMP5M5GH:00000002",
            "RequestPath": "/livemetrics"
        }
    ]
}

{
    "Timestamp": "2022-07-01T14:10:03.5739343Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 94    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 719 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 57",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:cab9dde8f67d58fb, TraceId:f466c3e8ebb202208a5ec6d42f013591, ParentId:263c08c90ec8929c",
            "SpanId": "cab9dde8f67d58fb",
            "TraceId": "f466c3e8ebb202208a5ec6d42f013591",
            "ParentId": "263c08c90ec8929c"
        },
        {
            "Message": "ConnectionId:0HMIRCMP5M5GH",
            "ConnectionId": "0HMIRCMP5M5GH"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMIRCMP5M5GH:00000002",
            "RequestId": "0HMIRCMP5M5GH:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "8accbb03-0591-4ce4-8407-5a631e02c09d",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics",
            "ArtifactType": "livemetrics"
        }
    ]
}
xsoheilalizadeh commented 1 year ago

I have the same issue while trying to get a dump

{"Timestamp":"2022-10-13T11:00:30.8691337Z","EventId":5,"LogLevel":"Information","Category":"Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController","Message":"Written to HTTP stream.","State":{"Message":"Written to HTTP stream.","{OriginalFormat}":"Written to HTTP stream."},"Scopes":[{"Message":"SpanId:fa17baa5e04a3e29, TraceId:8ba32a3ccdfaf56225fbef1f131d92da, ParentId:0000000000000000","SpanId":"fa17baa5e04a3e29","TraceId":"8ba32a3ccdfaf56225fbef1f131d92da","ParentId":"0000000000000000"},{"Message":"ConnectionId:0HMLCVOLKK104","ConnectionId":"0HMLCVOLKK104"},{"Message":"RequestPath:/processes RequestId:0HMLCVOLKK104:00000002","RequestId":"0HMLCVOLKK104:00000002","RequestPath":"/processes"},{"Message":"Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.GetProcesses (Microsoft.Diagnostics.Monitoring.WebApi)","ActionId":"77d736b1-5d9e-4014-b50b-846d360b12c6","ActionName":"Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.GetProcesses (Microsoft.Diagnostics.Monitoring.WebApi)"}]}
{"Timestamp":"2022-10-13T11:00:56.6623617Z","EventId":13,"LogLevel":"Error","Category":"Microsoft.AspNetCore.Server.Kestrel","Message":"Connection id \u00220HMLCVOLKK104\u0022, Request id \u00220HMLCVOLKK104:00000003\u0022: An unhandled exception was thrown by the application.","Exception":"System.IO.EndOfStreamException: Attempted to read past the end of the stream.    at Microsoft.Diagnostics.NETCore.Client.StreamExtensions.ReadBytesAsync(Stream stream, Int32 length, CancellationToken cancellationToken)    at Microsoft.Diagnostics.NETCore.Client.IpcHeader.ParseAsync(Stream stream, CancellationToken cancellationToken)    at Microsoft.Diagnostics.NETCore.Client.IpcMessage.ParseAsync(Stream stream, CancellationToken cancellationToken)    at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessageGetContinuationAsync(IpcEndpoint endpoint, IpcMessage message, CancellationToken cancellationToken)    at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessageAsync(IpcEndpoint endpoint, IpcMessage message, CancellationToken cancellationToken)    at Microsoft.Diagnostics.NETCore.Client.DiagnosticsClient.WriteDumpAsync(DumpType dumpType, String dumpPath, WriteDumpFlags flags, CancellationToken token)    at Microsoft.Diagnostics.Monitoring.WebApi.DumpService.DumpAsync(IEndpointInfo endpointInfo, DumpType mode, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DumpService.cs:line 77    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.\u003C\u003Ec__DisplayClass15_0.\u003C\u003CCaptureDump\u003Eb__0\u003Ed.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 223 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.\u003C\u003Ec__DisplayClass31_0.\u003C\u003CInvokeForProcess\u003Eb__0\u003Ed.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 702 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.\u003C\u003Ec__DisplayClass33_0\u00601.\u003C\u003CInvokeForProcess\u003Eb__0\u003Ed.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 734 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func\u00601 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.InvokeForProcess[T](Func\u00602 func, Nullable\u00601 processKey, String artifactType) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 724    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.InvokeForProcess(Func\u00602 func, Nullable\u00601 processKey, String artifactType) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 702    at lambda_method15(Closure , Object )    at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.TaskOfActionResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)    at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.\u003CInvokeActionMethodAsync\u003Eg__Awaited|12_0(ControllerActionInvoker invoker, ValueTask\u00601 actionResultValueTask)    at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.\u003CInvokeNextActionFilterAsync\u003Eg__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)    at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)    at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State\u0026 next, Scope\u0026 scope, Object\u0026 state, Boolean\u0026 isCompleted)    at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.\u003CInvokeInnerFilterAsync\u003Eg__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)    at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.\u003CInvokeNextExceptionFilterAsync\u003Eg__Awaited|26_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)    at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ExceptionContextSealed context)    at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State\u0026 next, Scope\u0026 scope, Object\u0026 state, Boolean\u0026 isCompleted)    at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.\u003CInvokeFilterPipelineAsync\u003Eg__Awaited|20_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)    at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.\u003CInvokeAsync\u003Eg__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)    at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.\u003CInvokeAsync\u003Eg__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)    at Microsoft.AspNetCore.Routing.EndpointMiddleware.\u003CInvoke\u003Eg__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)    at Microsoft.Diagnostics.Tools.Monitor.RequestLimitMiddleware.Invoke(HttpContext context) in /_/src/Tools/dotnet-monitor/RequestLimitMiddleware.cs:line 73    at Microsoft.AspNetCore.Authorization.Policy.AuthorizationMiddlewareResultHandler.HandleAsync(RequestDelegate next, HttpContext context, AuthorizationPolicy policy, PolicyAuthorizationResult authorizeResult)    at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)    at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)    at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication\u00601 application)","State":{"Message":"Connection id \u00220HMLCVOLKK104\u0022, Request id \u00220HMLCVOLKK104:00000003\u0022: An unhandled exception was thrown by the application.","ConnectionId":"0HMLCVOLKK104","TraceIdentifier":"0HMLCVOLKK104:00000003","{OriginalFormat}":"Connection id \u0022{ConnectionId}\u0022, Request id \u0022{TraceIdentifier}\u0022: An unhandled exception was thrown by the application."},"Scopes":[{"Message":"SpanId:a2b8f55f3888f452, TraceId:19aae55dbdb5a27ca561e138731fd0e2, ParentId:0000000000000000","SpanId":"a2b8f55f3888f452","TraceId":"19aae55dbdb5a27ca561e138731fd0e2","ParentId":"0000000000000000"},{"Message":"ConnectionId:0HMLCVOLKK104","ConnectionId":"0HMLCVOLKK104"},{"Message":"RequestPath:/dump RequestId:0HMLCVOLKK104:00000003","RequestId":"0HMLCVOLKK104:00000003","RequestPath":"/dump"}]}
{"Timestamp":"2022-10-13T11:00:58.2183792Z","EventId":52,"LogLevel":"Warning","Category":"Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource","Message":"Unexpected timeout from process 13. Process will no longer be monitored.","State":{"Message":"Unexpected timeout from process 13. Process will no longer be monitored.","processId":"13","{OriginalFormat}":"Unexpected timeout from process {processId}. Process will no longer be monitored."},"Scopes":[]}
{"Timestamp":"2022-10-13T11:00:58.2221081Z","EventId":44,"LogLevel":"Information","Category":"Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService","Message":"Stopping collection rules.","State":{"Message":"Stopping collection rules.","{OriginalFormat}":"Stopping collection rules."},"Scopes":[{"Message":"TargetProcessId:13 TargetRuntimeInstanceCookie:2dd906d0646846b9b3ce0b2f07b07a68","TargetProcessId":"13","TargetRuntimeInstanceCookie":"2dd906d0646846b9b3ce0b2f07b07a68"}]}
{"Timestamp":"2022-10-13T11:00:58.2237482Z","EventId":45,"LogLevel":"Information","Category":"Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService","Message":"All collection rules have stopped.","State":{"Message":"All collection rules have stopped.","{OriginalFormat}":"All collection rules have stopped."},"Scopes":[{"Message":"TargetProcessId:13 TargetRuntimeInstanceCookie:2dd906d0646846b9b3ce0b2f07b07a68","TargetProcessId":"13","TargetRuntimeInstanceCookie":"2dd906d0646846b9b3ce0b2f07b07a68"}]}
gest01 commented 1 year ago

@xsoheilalizadeh

I had also exactly the same problem : System.IO.EndOfStreamException: Attempted to read past the end of the stream.

/gcdump and /trace worked perfectly. Only with /dump the same error occurred.

It turned out that my resources for the target container were too tight, and the pod crashed immediately when i invoked the /dump endpoint (maybe some kind of Out Of Memory).

After I removed the resource requests/limits from the yaml everything worked as desired :-)

kasperk81 commented 1 year ago

on ubuntu 22.04 with dotnet 7:

# terminal 1
dotnet tool install --global dotnet-monitor
dotnet monitor collect -u http://localhost:52324 --no-auth --diagnostic-port /tmp/myport.sock

# terminal 2
dotnet publish -c Release -o pub --self-contained
DOTNET_DiagnosticPorts=/tmp/myport.sock  pub/helloworldapp

in terminal 1 the logs showed the target process id is 15180 but timedout:

06:15:32 warn: Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource[52] Unexpected timeout from process 15180. Process will no longer be monitored. 06:15:32 info: Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService[44] => TargetProcessId:16045 TargetRuntimeInstanceCookie:3124459885fd4a40b0709b167d709e50 Stopping collection rules. 06:15:32 info: Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService[45] => TargetProcessId:15180 TargetRuntimeInstanceCookie:3124459885fd4a40b0709b167d709e50 All collection rules have stopped.

i opened http://localhost:52324/logs?pid=15180 and got this:

{"status":400,"detail":"Unable to discover a target process."}

same for /trace?pid=.. and /trace?pid=.. and i've tested multiple times, same result with different pid.

workaround:

# shutdown dotnet-monitor in terminal 1 and use dotnet-trace instead
dotnet tool install --global dotnet-trace
dotnet-trace collect --diagnostic-port /tmp/myport.sock --format speedscope

# terminal 2
DOTNET_DiagnosticPorts=/tmp/myport.sock  pub/helloworldapp

works flawlessly. i can visualize the generated report at https://www.speedscope.app/

so this has to be dotnet-monitor bug.

jander-msft commented 1 year ago

so this has to be dotnet-monitor bug.

That doesn't prove very much in this case:

Previous reports suggest that the target application is no longer monitored after about 30 minutes; that would be 600 liveliness probes (30 minutes / 3 seconds-per-liveliness probes) before failure to respond. That's a lot more socket connections than what dotnet-trace does.

My current theory is that the target runtime doesn't always respond correctly when a diagnostic connection is terminated without a command (e.g. the liveliness probe), but I haven't been able to reproduce the issue yet.

If anyone can reproduce the problem and capture a dump of both dotnet-monitor and the target process, that would be very helpful.