Open martinothamar opened 5 years ago
IHostEnvironmentStatistics
still has a lot of value, especially for load shedding: if you only monitor the Orleans app CPU percentage, you will not see that the server is overloaded if this charge is caused by another process running on the machine.
All applications specific metrics should be implemented in IAppEnvironmentStatistics
instead. Note that we already get the memory usage via GC
.
I have to admit that I don't know how the values read from /proc/
or from the windows perf counter are influenced if you set CPU/Memory constraints on the container.
I created a custom IHostEnvironmentStatistics
package for containers running on AWS ECS (we're currently using it with Fargate, so we're skirting around the host vs container stats discussion of OP). My goal was to provide the necessary stats for load shedding and, more of a bonus, get some relatively live metrics on the Orleans Dashboard. I've also started looking into developing packages for K8s and generic Docker workloads (depending on how some upcoming work projects go).
I also feel as though the abstractions could use a little improvement.
First, it looks as though some stats consumers go straight to FloatValueStatistic
/IntValueStatistic
/etc. instances whereas others use the registered IHostEnvironmentStatistics
instance. At this time, it appears necessary for custom implementations to directly use FloatValueStatistic
/IntValueStatistic
. Although FloatValueStatistic
was recently made public, the others are still internal
, and I had to use reflection. I also caught the following comment from @jason-bragg on #6000:
Should these be in Orleans.Runtime.Internal? I don't think we want public using these surfaces as, imo, they are subject to change/removal. If we at least hide them behind an "internal" namespace, then that should mostly hide them and indicate that we're not supporting them as part of our public surface.
Sounds reasonable... to support IHostEnvironmentStatistics
extensibility, I think a generic Orleans component could wrap the registered IHostEnvironmentStatistics
instance and handle the FloatValueStatistic
/IntValueStatistic
instances. The generic component would just need to map the following:
RUNTIME_CPUUSAGE
=> IHostEnvironmentStatistics.CpuUsage
RUNTIME_MEMORY_TOTALPHYSICALMEMORYMB
=> IHostEnvironmentStatistics.TotalPhysicalMemory
RUNTIME_MEMORY_AVAILABLEMEMORYMB
=> IHostEnvironmentStatistics.AvailableMemory
Second, there's code duplication between the Linux & Windows (performance counters) packages. Specifically, each provides the RUNTIME_GC_TOTALMEMORYKB
, RUNTIME_DOT_NET_THREADPOOL_INUSE_WORKERTHREADS
, and RUNTIME_DOT_NET_THREADPOOL_INUSE_COMPLETIONPORTTHREADS
stats. These are interesting stats. I ended up copying these into my ECS implementation (since our workloads don't register the Linux or Windows implementations). I think these are good candidates to refactor into a common provider, so they're available in all environments.
Working on this for Linux: #5423
Todays implementation only considers stats on the host level, so it accounts for everything running on server including other services. Now that docker and k8s is becoming increasingly common, and Orleans being more suitable for microservices scenarios due to lower idle usage in recent versions it would be beneficial to change the
IHostEnvironmentStatistics
and related API's.The interface covers
We could also get these stats for the silo/client process using perf counters for Windows and
/proc/{pid}/
for LinuxThere's also all the
FloatValueStatistic/IntValueStatistic
things that store various runtime stats. I see now thatLoadShedQueueFlowController
uses theStatisticNames.RUNTIME_CPUUSAGE
statistic so I would need to add to that statistic for loadshedding to work I'm assuming. So the abstraction here feels kind of messy, but maybe it's ok. Just want to open the discussion.As for our specific scenario we're at the end of finally migrating all our Windows IIS/Services stuff to Linux and .NET Core. So currently we are deploying all our services to 3 different vm's in an availability set. So there will be 4+ containers running on every vm and we use OrleansDashboard everywhere there's Orleans, so having a process-specific graph for CPU/memory would be nice. Since Linux etc is pretty new to us we haven't yet decided on how to do monitoring and alerting, but recognize that Prometheus + Grafana seems to be the "cloud native" choice. I've been thinking about writing a Prometheus exporter for Orleans stats, but CPU/memory usage you could get from cgroup-exporter so don't know if that's worth it. But there are lots of other runtime
StatisticName
's that look interesting for exporting.