Open centur opened 7 years ago
Hi all, is there any basic guidance for Orleans internal tracking metrics?
There isn't, unfortunately. Metrics were originally added as low level counters for the dev team to look into during investigation. So they've been added to various components over the years without any top-down hierarchical approach. Some of them are pretty clear, and some are rather opaque. It would help to document the former I think.
What is current and what is Delta for Storage.Write?
In general, Current
is the latest value of the counter and Delta
is the change in its value since the last time metrics were reported. By default metrics are reported every 5 minutes. So deltas give a rough idea of the counters' velocity.
What units are used - Bytes KBs MBs?
Storage.*
counters except for .Latency
simply count invocations of the corresponding methods and their failures.
Is there any counters that would make sense to monitor for a product to proactively identify problems?
So far (based on what would make sense to be aware about) we are looking on:
Runtime.GC.PercentOfTimeInGC.Current
and comparing it with previous week
AgeOfMessagesBeingProcessed
- which is sounds ok, but there is no indication what units this is - milliseconds or ticks ?
We also tracking GC.LArgeObjectHeapSizeKb
and TotalMemoryKb
but not sure they make much sense for us - they are pretty low.
Catalog.Activation.CurrentCount.Current
Catalog.Activation.DuplicateActivations.Current
These two kind of makes sense but we can't interpret this - the value of 13.45 in the last one is good ? or bad ? or irrelevant for the health of the cluster ?
These two prob make no sense as we are looking on the wrong thread pool.
Runtime.DOT.NET.ThreadPool.InUse.CompletionPortThreads.Current
Runtime.DOT.NET.ThreadPool.InUse.WorkerThreads.Current
Any other interesting counters or
PS: I've attached the file we scraped from app insights (not from the actual code because I don't know what is really used there) if anyone interested to see other events OrleansTelemetry.txt
Here are some statistics that are good to keep an eye on.
App.Requests.Latency.Average.Millis
- the average is crude, but might be useful
App.Requests.LatencyHistogram.Millis
- better data in latencies, but hard to monitor
App.Requests.TimedOut.Current
Catalog.Activation.CurrentCount
Grain.*.Current
Runtime.CpuUsage
Runtime.GC.GenSizesKb
- gen2 in particular
Runtime.GC.LargeObjectHeapSizeKb
Runtime.GC.PercentOfTimeInGC
Scheduler.NumLongRunningTurns.Current
- those that exceed 200ms by default
Thanks, tracking these ( except Grain.* - too many grains, worth an analytics query over all counters to see hottest ones)
what is the meaning and probable cause of Scheduler.NumLongRunningTurns.Current always increasing, actually at some point the silo crashes. so i want to now what can cause this to keep increasing ? this is a case that i am already having.
@mohamedhammad Hopefully you've found the answer to this by now, but wanted others who are searching to be able to find an answer...
We've run across this same problem here early in our production experience with Orleans. We found that we had a grain occasionally deadlocking. We're assuming that eventually, enough deadlocked grains accumulated that they used up the entire thread pool and all requests to the silo started timing out.
If your grain is NOT deadlocking, but just has high-concurrency, long-running requests... Orleans may not be the best solution for implementing that particular function.
Long-running requests can be moved to StatelessWorkers or onto the default TaskScheduler, or execute in the background on the current grain so that the method returns promptly. A separate call or a callback can be used to check for progress/status.
If a given turn is running for a long time, that means that some code was synchronously running (the thread was busy/waiting) for that period of time. In that case, the issue could be lock contention, blocking IO (etc), or long-running CPU computations.
We could add an option to avoid the in-built thread pool and instead schedule all work on the default ThreadPool (which is what the default TaskScheduler schedules on) and under that option there will not be a need for these kinds of warnings. Users can get themselves into trouble by blocking shared threads for long periods, but that's not Orleans-specific and so a warning is likely not warranted if the default ThreadPool is used.
We could add an option to avoid the in-built thread pool and instead schedule all work on the default ThreadPool (which is what the default TaskScheduler schedules on) and under that option there will not be a need for these kinds of warnings. Users can get themselves into trouble by blocking shared threads for long periods, but that's not Orleans-specific and so a warning is likely not warranted if the default ThreadPool is used.
In 3.2.0 we switched to using the default thread pool.
Hi all, is there any basic guidance for Orleans internal tracking metrics ? e.g. our AppInsights instance received 382 different Telemetry counters from Orleans Silo, Some of them makes sense for us, some of them - not much. How can we figure out which telemetry metrics are worth tracking and which are nitpicking and can be ignored\ommited ? And what is the convention that is used for some events. Just to start with some:
What is current and what is Delta for Storage.Write ? What is the value of knowing a delta ? Delta of which period ? Since the last write in the storage or since last telemetry reported ? What units are used - Bytes KBs MBs ?