dotnet / dotnet-monitor

This repository contains the source code for .NET Monitor - a tool that allows you to gather diagnostic data from running applications using HTTP endpoints
MIT License
638 stars 112 forks source link

AspNetResponseStatus Not Triggering As Expected #1107

Open kkeirstead opened 2 years ago

kkeirstead commented 2 years ago

Note: I haven't determined the root cause of this issue, but I wanted to share the behavior I was noticing as well as how to repro it.

Using a simple configuration of the AspNetResponseStatus trigger, the collection rule doesn't always trigger when it should. In basic testing, it appears that instead of triggering on each instance of a 404 status code, it generally triggers on every third instance. I first noticed this using the configuration below on a dummy "FirstWebApp" application by entering an invalid path (e.g. localhost:5002/abc). During debugging, it appears that the traceEvents for a stopped activity (with the corresponding 404 status code) aren't being checked until several more requests have occurred.

To repro, below is the collection rule configuration I was using (with a local basic web app). Try a URL that won't work (such as localhost:5002/abc), and continue repeatedly trying invalid URLs. Pay attention to when the collection rule is successfully triggered and a dump is actually egressed.

Especially since I haven't looked far enough into it to find a solution, it's possible this is due to an issue with how I've configured this file, but I wanted to list something for it before investigating further. If you have any questions or can't get it to repro, let me know and I can make a screen recording of the behavior I'm seeing.

  "CollectionRules": {
    "BadResponseStatus": {
      "Trigger": {
        "Type": "AspNetResponseStatus",
        "Settings": {
          "ResponseCount": 1,
          "StatusCodes": [
            "400-499"
          ],
          "SlidingWindowDuration": "00:00:05"
        }
      },
      "Actions": [
        {
          "Type": "CollectDump",
          "Settings": {
            "Egress": "artifacts",
            "Type": "Full"
          }
        }
      ],
      "Limits":{
        "ActionCount": 1000,
        "ActionCountSlidingWindowDuration": "00:30:00"
      }
    }
  }
jander-msft commented 2 years ago

Depending on the timing of the requests you are sending, you may be sending requests that you expect to trigger the collection while the rule is not monitoring for the trigger condition. For example, when the trigger condition is satisfied, the trigger is turned off and the entire action list is processed; when the action list is finished executing, the trigger is started again. If you attempt to make requests that you think should satisfy the trigger condition during the trigger between when the trigger was turned off and when it eventually is turned on again, those requests will not be counted.

kkeirstead commented 2 years ago

https://user-images.githubusercontent.com/85592574/140425835-1fe3343a-1d35-40af-949c-3d83c100ae01.mp4

Let's see if this works; I dropped a screen recording (there's no sound, but it should be easy to follow with the cursor). It doesn't feel like this would be the expected behavior, but I could be missing something.

kkeirstead commented 2 years ago

Justin and I took a look at this together yesterday; the current theory is that there may be a buffer issue in the runtime requiring approximately 1-3 additional requests beyond the ResponseCount to trigger the collection rule (and that the problem is not in dotnet monitor). I was able to get this to repro with net5.0 and net6.0 applications (collection rules are not supported on netcoreapp3.1).

jander-msft commented 2 years ago

FYI, @josalem, the trace event library doesn't invoke the registered callback until more events are put into the event pipe. Putting one event into the pipe and waiting for a while (about a minute) wasn't enough for us to observe the event callback.

josalem commented 2 years ago

This sounds like an issue I identified in TraceEvent last year. If it is the same issue, it is because the reader in EventPipeEventSource requires 8-byte alignment for its reads. EventPipe does not guarantee that event blocks will be 8-byte aligned due to variable length encoding of the headers. Different processors seem to impact the chances of this happening due to differing values in their hardware counters which are used for the timestamps in headers. The reader won't dispatch events in a block until it completes the read. This can result in a complete EventBlock sitting in the readers buffer until it receives the next event.

https://github.com/microsoft/perfview/issues/1447