Orleans Silo Telemetry docs\guidance ?

centur commented 7 years ago

Hi all, is there any basic guidance for Orleans internal tracking metrics ? e.g. our AppInsights instance received 382 different Telemetry counters from Orleans Silo, Some of them makes sense for us, some of them - not much. How can we figure out which telemetry metrics are worth tracking and which are nitpicking and can be ignored\ommited ? And what is the convention that is used for some events. Just to start with some:

Storage.Activate.Errors.Current
Storage.Activate.Total.Current
Storage.Activate.Total.Delta

Storage.Azure.Table.ServerBusy.Current
Storage.Azure.Table.ServerBusy.Delta

Storage.Clear.Errors.Current
Storage.Clear.Latency.Current
Storage.Clear.Total.Current

Storage.Read.Errors.Current
Storage.Read.Errors.Delta
Storage.Read.Latency.Current
Storage.Read.Total.Current
Storage.Read.Total.Delta

Storage.Write.Errors.Current
Storage.Write.Errors.Delta
Storage.Write.Latency.Current
Storage.Write.Total.Current
Storage.Write.Total.Delta

What is current and what is Delta for Storage.Write ? What is the value of knowing a delta ? Delta of which period ? Since the last write in the storage or since last telemetry reported ? What units are used - Bytes KBs MBs ?

sergeybykov commented 7 years ago

Hi all, is there any basic guidance for Orleans internal tracking metrics?

There isn't, unfortunately. Metrics were originally added as low level counters for the dev team to look into during investigation. So they've been added to various components over the years without any top-down hierarchical approach. Some of them are pretty clear, and some are rather opaque. It would help to document the former I think.

What is current and what is Delta for Storage.Write?

In general, Current is the latest value of the counter and Delta is the change in its value since the last time metrics were reported. By default metrics are reported every 5 minutes. So deltas give a rough idea of the counters' velocity.

What units are used - Bytes KBs MBs?

Storage.* counters except for .Latency simply count invocations of the corresponding methods and their failures.

centur commented 7 years ago

Is there any counters that would make sense to monitor for a product to proactively identify problems? So far (based on what would make sense to be aware about) we are looking on: Runtime.GC.PercentOfTimeInGC.Current and comparing it with previous week AgeOfMessagesBeingProcessed - which is sounds ok, but there is no indication what units this is - milliseconds or ticks ?

We also tracking GC.LArgeObjectHeapSizeKb and TotalMemoryKb but not sure they make much sense for us - they are pretty low.

Catalog.Activation.CurrentCount.Current Catalog.Activation.DuplicateActivations.Current These two kind of makes sense but we can't interpret this - the value of 13.45 in the last one is good ? or bad ? or irrelevant for the health of the cluster ?

These two prob make no sense as we are looking on the wrong thread pool. Runtime.DOT.NET.ThreadPool.InUse.CompletionPortThreads.Current Runtime.DOT.NET.ThreadPool.InUse.WorkerThreads.Current

Any other interesting counters or

PS: I've attached the file we scraped from app insights (not from the actual code because I don't know what is really used there) if anyone interested to see other events OrleansTelemetry.txt

sergeybykov commented 7 years ago

Here are some statistics that are good to keep an eye on.

App.Requests.Latency.Average.Millis - the average is crude, but might be useful App.Requests.LatencyHistogram.Millis - better data in latencies, but hard to monitor App.Requests.TimedOut.Current Catalog.Activation.CurrentCount Grain.*.Current Runtime.CpuUsage Runtime.GC.GenSizesKb - gen2 in particular Runtime.GC.LargeObjectHeapSizeKb Runtime.GC.PercentOfTimeInGC Scheduler.NumLongRunningTurns.Current - those that exceed 200ms by default

centur commented 7 years ago

Thanks, tracking these ( except Grain.* - too many grains, worth an analytics query over all counters to see hottest ones)

mohamedhammad commented 6 years ago

what is the meaning and probable cause of Scheduler.NumLongRunningTurns.Current always increasing, actually at some point the silo crashes. so i want to now what can cause this to keep increasing ? this is a case that i am already having.

pipermatt commented 4 years ago

@mohamedhammad Hopefully you've found the answer to this by now, but wanted others who are searching to be able to find an answer...

We've run across this same problem here early in our production experience with Orleans. We found that we had a grain occasionally deadlocking. We're assuming that eventually, enough deadlocked grains accumulated that they used up the entire thread pool and all requests to the silo started timing out.

If your grain is NOT deadlocking, but just has high-concurrency, long-running requests... Orleans may not be the best solution for implementing that particular function.

ReubenBond commented 4 years ago

Long-running requests can be moved to StatelessWorkers or onto the default TaskScheduler, or execute in the background on the current grain so that the method returns promptly. A separate call or a callback can be used to check for progress/status.

If a given turn is running for a long time, that means that some code was synchronously running (the thread was busy/waiting) for that period of time. In that case, the issue could be lock contention, blocking IO (etc), or long-running CPU computations.

We could add an option to avoid the in-built thread pool and instead schedule all work on the default ThreadPool (which is what the default TaskScheduler schedules on) and under that option there will not be a need for these kinds of warnings. Users can get themselves into trouble by blocking shared threads for long periods, but that's not Orleans-specific and so a warning is likely not warranted if the default ThreadPool is used.

sergeybykov commented 4 years ago

We could add an option to avoid the in-built thread pool and instead schedule all work on the default ThreadPool (which is what the default TaskScheduler schedules on) and under that option there will not be a need for these kinds of warnings. Users can get themselves into trouble by blocking shared threads for long periods, but that's not Orleans-specific and so a warning is likely not warranted if the default ThreadPool is used.

In 3.2.0 we switched to using the default thread pool.

dotnet / orleans

Orleans Silo Telemetry docs\guidance ? #3065