Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
739 stars 493 forks source link

System info failures log to diagnostics #3228

Open j82w opened 2 years ago

j82w commented 2 years ago

There has been several instance where the diagnostics does not have the CPU history, and in some scenarios it's missing entries. The history is every 10 seconds, but then one is missing.

To understand why the CPU is not collected the exception message should be logged to the diagnostics to help troubleshoot this issue.

https://msdata.visualstudio.com/_git/CosmosDB?path=/Product/Microsoft.Azure.Documents/SharedFiles/Rntbd2/SystemUsageMonitor.cs

Example part of the diagnostics where it is missing:

                {
                    "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
                    "id": "a7b550b7-4a1e-4daf-b10e-37911ad27ad3",
                    "start time": "03: 26: 33: 093",
                    "duration in milliseconds": 84096.2858,
                    "children": [
                        {
                            "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
                            "id": "c7112122-e7c4-4b67-9fbc-5b95699fdeae",
                            "start time": "03: 26: 33: 093",
                            "duration in milliseconds": 84096.1677,
                            "children": [
                                {
                                    "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
                                    "id": "96dffd72-7842-4365-87f1-be6c762b53d0",
                                    "start time": "03: 26: 33: 093",
                                    "duration in milliseconds": 84096.0744,
                                    "children": [
                                        {

With the change the JSON should look something like:

                {
                    "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
                    "id": "a7b550b7-4a1e-4daf-b10e-37911ad27ad3",
                    "start time": "03: 26: 33: 093",
                    "duration in milliseconds": 84096.2858,
                    "children": [
                        {
                            "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
                            "id": "c7112122-e7c4-4b67-9fbc-5b95699fdeae",
                            "start time": "03: 26: 33: 093",
                            "duration in milliseconds": 84096.1677,
                            "data":{ 
                                 "System info failure": "Some exception message"
                            },
                            "children": [
                                {
                                    "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
                                    "id": "96dffd72-7842-4365-87f1-be6c762b53d0",
                                    "start time": "03: 26: 33: 093",
                                    "duration in milliseconds": 84096.0744,
                                    "children": [
                                        {
j82w commented 2 years ago

Example of the diagnostics missing entries. Notice it's still being collected at 10 second interval, but some are missing. This points to a possible bug that is getting only logged to traces. It should be added to the diagnostics so the problem can be root caused and fixed.

"System Info": {
  "systemHistory": [
      {
          "dateUtc": "2022-05-25T12:02:17.7081784Z",
          "cpu": 4.253,
          "memory": 5034956.000,
          "threadInfo": {
              "isThreadStarving": "False",
              "threadWaitIntervalInMs": 0.3737,
              "availableThreads": 32766,
              "minThreads": 2,
              "maxThreads": 32767
          }
      },
      {
          "dateUtc": "2022-05-25T12:02:37.7050251Z",
          "cpu": 4.808,
          "memory": 5035208.000,
          "threadInfo": {
              "isThreadStarving": "False",
              "threadWaitIntervalInMs": 0.2534,
              "availableThreads": 32766,
              "minThreads": 2,
              "maxThreads": 32767
          }
      },
      {
          "dateUtc": "2022-05-25T12:02:57.7051946Z",
          "cpu": 4.394,
          "memory": 5035588.000,
          "threadInfo": {
              "isThreadStarving": "False",
              "threadWaitIntervalInMs": 0.0203,
              "availableThreads": 32766,
              "minThreads": 2,
              "maxThreads": 32767
          }
      },
      {
          "dateUtc": "2022-05-25T12:03:07.7095160Z",
          "cpu": 5.220,
          "memory": 5035444.000,
          "threadInfo": {
              "isThreadStarving": "False",
              "threadWaitIntervalInMs": 0.3052,
              "availableThreads": 32766,
              "minThreads": 2,
              "maxThreads": 32767
          }
      },
      {
          "dateUtc": "2022-05-25T12:03:17.7124857Z",
          "cpu": 3.497,
          "memory": 5034220.000,
          "threadInfo": {
              "isThreadStarving": "False",
              "threadWaitIntervalInMs": 0.0296,
              "availableThreads": 32766,
              "minThreads": 2,
              "maxThreads": 32767
          }
      },