IHostEnvironmentStatistics and runtime statistics improvements

martinothamar commented 5 years ago

Working on this for Linux: #5423

Todays implementation only considers stats on the host level, so it accounts for everything running on server including other services. Now that docker and k8s is becoming increasingly common, and Orleans being more suitable for microservices scenarios due to lower idle usage in recent versions it would be beneficial to change the IHostEnvironmentStatistics and related API's.

The interface covers

Host CPU usage percentage
Host total physical memory in bytes
Host available memory in bytes

We could also get these stats for the silo/client process using perf counters for Windows and /proc/{pid}/ for Linux

Process CPU usage percentage
Process memory usage in bytes

There's also all the FloatValueStatistic/IntValueStatistic things that store various runtime stats. I see now that LoadShedQueueFlowController uses the StatisticNames.RUNTIME_CPUUSAGE statistic so I would need to add to that statistic for loadshedding to work I'm assuming. So the abstraction here feels kind of messy, but maybe it's ok. Just want to open the discussion.

As for our specific scenario we're at the end of finally migrating all our Windows IIS/Services stuff to Linux and .NET Core. So currently we are deploying all our services to 3 different vm's in an availability set. So there will be 4+ containers running on every vm and we use OrleansDashboard everywhere there's Orleans, so having a process-specific graph for CPU/memory would be nice. Since Linux etc is pretty new to us we haven't yet decided on how to do monitoring and alerting, but recognize that Prometheus + Grafana seems to be the "cloud native" choice. I've been thinking about writing a Prometheus exporter for Orleans stats, but CPU/memory usage you could get from cgroup-exporter so don't know if that's worth it. But there are lots of other runtime StatisticName's that look interesting for exporting.

benjaminpetit commented 5 years ago

IHostEnvironmentStatistics still has a lot of value, especially for load shedding: if you only monitor the Orleans app CPU percentage, you will not see that the server is overloaded if this charge is caused by another process running on the machine.

All applications specific metrics should be implemented in IAppEnvironmentStatistics instead. Note that we already get the memory usage via GC.

I have to admit that I don't know how the values read from /proc/ or from the windows perf counter are influenced if you set CPU/Memory constraints on the container.

seniorquico commented 4 years ago

I created a custom IHostEnvironmentStatistics package for containers running on AWS ECS (we're currently using it with Fargate, so we're skirting around the host vs container stats discussion of OP). My goal was to provide the necessary stats for load shedding and, more of a bonus, get some relatively live metrics on the Orleans Dashboard. I've also started looking into developing packages for K8s and generic Docker workloads (depending on how some upcoming work projects go).

I also feel as though the abstractions could use a little improvement.

First, it looks as though some stats consumers go straight to FloatValueStatistic/IntValueStatistic/etc. instances whereas others use the registered IHostEnvironmentStatistics instance. At this time, it appears necessary for custom implementations to directly use FloatValueStatistic/IntValueStatistic. Although FloatValueStatistic was recently made public, the others are still internal, and I had to use reflection. I also caught the following comment from @jason-bragg on #6000:

Should these be in Orleans.Runtime.Internal? I don't think we want public using these surfaces as, imo, they are subject to change/removal. If we at least hide them behind an "internal" namespace, then that should mostly hide them and indicate that we're not supporting them as part of our public surface.

Sounds reasonable... to support IHostEnvironmentStatistics extensibility, I think a generic Orleans component could wrap the registered IHostEnvironmentStatistics instance and handle the FloatValueStatistic/IntValueStatistic instances. The generic component would just need to map the following:

RUNTIME_CPUUSAGE => IHostEnvironmentStatistics.CpuUsage
RUNTIME_MEMORY_TOTALPHYSICALMEMORYMB => IHostEnvironmentStatistics.TotalPhysicalMemory
RUNTIME_MEMORY_AVAILABLEMEMORYMB => IHostEnvironmentStatistics.AvailableMemory

Second, there's code duplication between the Linux & Windows (performance counters) packages. Specifically, each provides the RUNTIME_GC_TOTALMEMORYKB, RUNTIME_DOT_NET_THREADPOOL_INUSE_WORKERTHREADS, and RUNTIME_DOT_NET_THREADPOOL_INUSE_COMPLETIONPORTTHREADS stats. These are interesting stats. I ended up copying these into my ECS implementation (since our workloads don't register the Linux or Windows implementations). I think these are good candidates to refactor into a common provider, so they're available in all environments.

dotnet / orleans

IHostEnvironmentStatistics and runtime statistics improvements #5426