Remove performance counters from ThreadTrackingStatistic

jason-bragg commented 7 years ago

As part of the move to .NetCore we must isolate all of our dependencies on windows performance counters. Most of the statistics we write to performance counters have been partitioned from Orleans core, but we still read environment information like cpu and memory from performance counters.

A first pass at addressing this was done in PR “Updated ThreadTrackingStatistic to use PerfCounterTelemetryConsumer #2147”. This approach, however was not ideal and the PR is out of date, so we’re closing the PR.

This issue needs be addressed prior to Orleans 2.0

xiazen commented 7 years ago

after reading couple related PR and issue, I think the suggested solution is

Create public IThreadTrackingStatistic interface with the public surface of ThreadTrackingStatistic.
Move ThreadTrackingStatistic to OrleansTelemetryConsumers.Counters as an implementation of IThreadTrackingStatistic.
Add extension function to register ThreadTrackingStatistic type in the service collection for DI.
Update Orleans to get IThreadTrackingStatistic from DI.

One topic related to this issue, RuntimeStatisticsGroup is using PerformanceCounters too, so that it has the same problem with TreadTrackingStatistic. I think we can apply the same solution to RuntimeStatisticGroup to tackle the PerformanceCounters problem. Open the door to users to write their own implementation for RuntimeStatisticsGroup and TreadTrackingStatistic in other systems.

jason-bragg commented 7 years ago

This seems reasonable, though I'd love to see @galvesribeiro and @attilah views on it.

I have a question about the DI pattern though. How would services from an optionally included external package like OrleansTelemetryConsumers.Counters be added to DI? Would configuration set this? Would users be required to add a Startup class that added the service?

I ask because it's a general problem that limits all our external components. If there is a good pattern for solving it in this case, we may be able to apply that pattern more widely at great advantage.

xiazen commented 7 years ago

why did you close this?

Would users be required to add a Startup class that added the service?

I think in Silo.cs, we injected a set of default implementation for services to DI. So in windows system, I think we can inject default implementation for ThreadTrackingStatistic and RuntimeStatisticGroup . In other system, we have no choice to use a Startup class to inject the service. But since we intend to make OrleansTelemetryConsumers.Counters an optional package, we cannot inject the default implementation in OrleansTelemetryConsumers.Counters to DI in silo.cs as we do it now. So I think the answer of your question is yes.

@galvesribeiro and @attilah can correct me if I'm not correct.

galvesribeiro commented 7 years ago

In my opinion we are merging the statistics collection with the end consumer. The purpose of OrleansTelemetryConsumers.Counters is that it behave as the name suggest, a consumer of telemetry that store the metrics on PerfCounters.

Regarding reading CPU/Memory, we don't have that in .Net Core so I suggest two options:

Create an interface and abstract the reading of those machine statistics and let people register to it. In this option, Orleans would read the data thru this interface and publish those metrics to any IMetricsTelemetryConsumer, not just the perf counters one.
Use the .Net Standard provided APIs that would read the metrics and push then to the registered IMetricsTelemetryConsumers.

In all cases, the registration and dependencies for this thing (an extension method in config or just the startup class) should be revamping with the host on the new SiloBuilder... The way it is today we are just piling more and more things that is getting harder and harder to configure things.

jason-bragg commented 7 years ago

why did you close this?

Because I apparently can't read buttons.. and now I must admit this publicly

xiazen commented 7 years ago

I thought about option 1 too, now after a second thought on my original post, if we are going to DI RuntimeStatisticGroup, we need to create abstraction for FloatValueStatistic and IntValueStatistic too and DI those, since RuntimeStatisticGroup has to work with those two to work properly. This chain effects may lead to create abstraction for the whole Orleans statistic system. I'm a big fan for create abstraction for Orleans statistic system and all the benefits it would bring, but this is a lot of work. So I incline to @galvesribeiro 's option1 now to solve just this perf counter problem,.

About option 2, do you know what API they are going to provide?

jason-bragg commented 7 years ago

In the interest of narrowing to the core problem, I don't think we should care (at this time) who needs the IThreadTrackingStatistic. I'm also not clear if IThreadTrackingStatistic is a good name, as what we're gathering are OS metrics; metrics like memory, cpu, and time in GC, correct?

I think the current need is to abstract out how these metrics are obtained, and preserve backwards compatibility, which means introducing an implementation of the abstraction that uses performance counters to get these metrics.

galvesribeiro commented 7 years ago

That is what I suggested @jason-bragg.

However, I just spoke with a PM in CoreFX team and I think we have a workaround...

For now, we just PInvoke the native WinAPIs to read the CPU/Memory usage so we remove the usage of that PerfCounters class and keep reading the same data from Windows.
I'm discussing with him the proper design of Cross-Plat PerfCounters on CoreFX, so I can implement it right away and we will have it on netstandard and then we can make the consumer and readers to use this new nuget.
Once the CoreFX package got published, we remove the PInvokes.

The worse case scenario, we can hold on on this issue as blocked by CoreFX and once I finish it, we update Orleans so no need for PInvoke.

jason-bragg commented 7 years ago

That is what I suggested

Understood. Just clarifying to keep focus on the core problem. Orleans statistics system is a hairball spit up by the less fortunate of Schrödinger's cats, I'd not suggest untangling it in one sitting as the radiation is strong.

As for a xplat perf counters, that would be great, but this would be a good change anyway as it improves testability, and putting it off with the expectation that someone may solve it for us does not, in my view, seem necessary.

At the end of the day, we've passed this task to Xiao, so I'll leave it to her to balance the pros/cons. My main intention has been to clarify the problem and critically examine proposed solutions.

xiazen commented 7 years ago

I think using a abstraction layer nicely decouple Orleans 2.0 timeline with netstandard Cross-Plat PerfCounters development timeline. Once we have the abstraction layer, there's nothing stopping us have an implementation using Cross-Plat PerfCounters or PInvoke, or other cross-platform solution.

I don't see a huge gain of us using PInvoke, since we still need to come up with perf measuring solution for other system. And for windows, we still can use Perf Counters. We can always use #ifdef to switch from full dotnet to dotnet standard. And Perf Counters is more native than PInvoke.

xiazen commented 7 years ago

But thanks for the effort on Cross-Plat PerfCounters with netstandard team ! it would be awesome if we have it ! cuz idealy it is their responsibility to support it.

galvesribeiro commented 7 years ago

The PInvoke solution is just basically a temporary workaround to read perfCounter data on Windows or read libc statistics on Linux/OSX just like it is done in Mono. The idea with it is just to remove the machine statistics reading part from PerfCounter windows-only classes. The producer part (which may be confused with namming) which is called TelemetryConsumer, is already fully extracted from the core and no changes are need now. They are a net45 assembly and ofc works only on Windows. Once CoreFX get it done, we can just update this nuget for it.

xiazen commented 7 years ago

Yes I understand that you are suggesting PInvoke is just temporary. My understanding of PInvoke is it still need an dll to invoke from to perform functionalities. So why not use/copy the native functions in those dlls to implementing our statistic abstraction? I just don't see PInvoke is our only option for a temporary work around. For windows, we can use PerfCounters, for Linux/IOS, we can use its native bash commands. And we can switch between those based on its underlying OS system. And those are more native and reliable IMO.

galvesribeiro commented 7 years ago

What I'm thinking is something like this pseudo-code:

public static class StatisticReader
{
    public static float GetCPUUsage()
    {
        if(RuntimeEnvironment.IsWindows)
        {
            return Win32PInvoke.NativeCPUUsage;
        }

        if(RuntimeEnvironment.IsOSX)
        {
            return OSXPInvoke.NativeCPUUsage;
        }

        if(RuntimeEnvironment.IsLinux)
        {
            return LinuxPInvoke.NativeCPUUsage;
        }
    }

    public static  long GetMemoryUsage()
    {
        if(RuntimeEnvironment.IsWindows)
        {
            return Win32PInvoke.NativeMemoryUsage;
        }

        if(RuntimeEnvironment.IsOSX)
        {
            return OSXPInvoke.NativeMemoryUsage;
        }

        if(RuntimeEnvironment.IsLinux)
        {
            return LinuxPInvoke.NativeMemoryUsage;
        }
    }
}

That would just be a replacement for the real implementation of the PerfCounters and would not require changes on the way Orleans read those counters publicly. What I think with that is that we don't need to make big changes just to workaround and remove the counters read for now.

Those counters would probably be touched again once we agree on revamp the SiloHost/SiloHostBuilder and the Telemetry APIs consolidation, so I would suggest to not spend too much efforts on that now. IMHO replace the PerfCounter classes with a dumb PInvoke one would be enough for now and since they are OS-based native libraries they are there and would not inject any external dependency or new code.

Again, this is just an attempt to save us time to work on it better later on.

xiazen commented 7 years ago

I think we are on the same page regards on abstraction layer + implementation on different OS system. But IMO we should avoid using concrete class and static class. An interface bring much more flexibility

jdom commented 7 years ago

I agree that these are 2 different scenarios:

one is writing metrics (which I'm not entirely familiar with the work, but I remember being told it is already done and separated into its own nuget)
the other one is reading CPU and memory to use in the load shedding algorithm (which just coincidentally currently uses the perf counter APIs but for reading instead of writing).

Since it appears we only need to solve scenario (2) -but please correct me if I'm wrong- then I agree that probably the easiest would be to PInvoke to each system, assuming as @galvesribeiro mentioned, that this doesn't bring any additional strange native dependencies to Orleans, nor make us have to multi-target some more stuff. I don't really see a harm in switching to PInvoke even in the Windows platform, but if this was a concern, we could even define this function in the PlatformServices project, and have something different for full .NET (but again, I prefer not to, unless there are concerns with reading values using PInvoke in .NET).

galvesribeiro commented 7 years ago

Yes, @jdom is right. The PerfCounters telemetry consumer nuget already has all the write operation to PerfCounters. What we are discussing here are the remaining read operation which today rely on PerfCounter classes which only work on windows.

PInvoke doesn't add any extra dependencies nor does it require extra multi-targeting. If no one has anything agains this decision, I can have a look on the remaining usage of the PerfCounter classes, check what APIs would map on Windows, Linux, or OSX, and open an issue with a proposal. If its all ok, I can put a PR together. Following this strategy, would be IMHO the simplest way to get rid of the remaining PerfCounter usages and at the same time getting cross-plat and as a bonus we have load shedding algorithm back on vNext solution (today it is disabled with NETSTANDARD_TODOsince we don't have PerfCounters there.

xiazen commented 7 years ago

I think the problem here is to eliminate Perf Counter usage in Orleans core.

I think what confuses me is that there's a list of PerfCounters in ThreadTrackingStatistic and RuntimeStatisticsGroup. Those PerfCounters includes more than just CPU and memory. So are we just remove CPU and Memory perf counter and replace it with StatisticReader(I will just call it this) in this issue, or are we replace all the perf counters with StatisticReader in this issues.

xiazen commented 7 years ago

The PerfCounters used in ThreadTrackingStatistic and RuntimeStatisticsGroup are all system related

xiazen commented 7 years ago

IMO, we should build a abstraction layer, Interface IStatisticReader { ReadCPU(); ReadMemory(); ReadProcessStatistics(); Read.NET CLR MemoryStatistics(); ReadNumberOfInducedGCs() ReadLargeObjectHeapSize(); .... a lot more system specific statistics

} this interface should eliminate all perf counters usage in RuntimeStaisticsGroup and ThreadTrackingStatistic so we can go netstandard, which I think it is the goal here.

And then we can have PInvoke implementation, or the future shining cross platform perf counter implementation.

I don't see why we split those system related metrics and CPU and memory into two problems.

galvesribeiro commented 7 years ago

I think the problem here is to eliminate Perf Counter usage in Orleans core.

Yes, I want to remove it from core and replace it with PInvoke classes that get whatever counters we need (if available in the target OS), including CPU and memory.

So are we just remove CPU and Memory perf counter in this issue, or are we removing all the perf counters in this issues.

No, we are removing it everywhere, not just CPU and Memory. That was just an example. All the remaining counters are read-only. All the counters we write information to are already removed and now are in its own telemetry consumer package.

I think you are confused by the Counter term here (which is really normal... I got confused myself when I started it) so the only place in ThreadTrackingStatistics that uses PerfCounters class is here: https://github.com/dotnet/orleans/blob/dfb84055477518a91c509419c54a2bfdcb2ca413/src/Orleans/Statistics/ThreadTrackingStatistic.cs#L209

And in RuntimeStatisticsGroup here https://github.com/dotnet/orleans/blob/2bde0e776b70e2b2df3b65a8dc0bbf5754a83ea6/src/Orleans/Statistics/RuntimeStatisticsGroup.cs#L17 which I believe there is a shim that @jdom did when we started working in .Net Core port.

Again, we are not spliting CPU and Memory in another problem. What I'm trying to say is that we don't need an extra abstraction now just to read some counters which will be eventually either published to a ITelemetryConsumer to be written somewhere (maybe on the PerfCounter telemetry consumer) or to be used in load shedding algorithm.

jason-bragg commented 7 years ago

Ok. Let's take a step back here.

Finding a short term solution is not in any way a goal here. The 'general' problem is how do we get environment statistics, like cpu, memory, ... Performance counters is only relevant because that is how we currently get these values, and in that the lack of support for them on .netcore is necessitating that we address this.

The original plan was to define an abstraction layer, and move the current perf counters behind it, and then figure out how to get this data on .netcore. The abstraction layer allows us to maintain the current behavior and move to more cross platform solutions when those become available. It also improves testability as the system can now be mocked or replaced with test versions.

PInvoke is viable as a short term hack,, fine, but that is in no way the goal. In the long run it is not a good solution at all because there is no compile time checks on the calls. Version changes, supported platforms, and deployment dependencies all can make this flaky

I'm entirely ok with compromises when ship pressures dictate such, but when we've a reasonable plan which does not require much work, I've difficulty understanding why we're arguing for a case statement that pinvokes for each supported platform and hoping it works in the deployed service.

While I can see this as part of the general statistics problem, that's a big problem with no clear effort scheduled to address it. Tying this narrow problem in with it adds unnecessary complexity. Does this risk that we may need to rework, or throw away, this abstraction when the statistics system is 'fixed'? Sure, but until we know what that is going to look like, a simple abstraction is a good solution.

galvesribeiro commented 7 years ago

I have to disagree with you @jason-bragg... PInvokes are not a hack. They are the way all xplat frameworks you see today in the end will rely into.

no compile time checks on the calls

Why? We do have compile time checks with PInvokes if you do it right. We are using .net types as usual. There is not sorcery in that.

I'm entirely ok with compromises when ship pressures dictate such, but when we've a reasonable plan which does not require much work, I've difficulty understanding why we're arguing for a case statement that pinvokes for each supported platform and hoping it works in the deployed service.

Why would that doesn't work? Every time you open a socket in dotnet core (and multiple other classes) it will eventually have such check at some point. The win32 APIs for performance counters are the same since WinNT was invented. There are no changes in it. Same for other OSes which basically invoke few methods on libc.

You guys are free to decide add an abstraction there but I see no point on have an abstraction that will never have multiple implementations. Also, have an abstraction and let other people to register other implementations at this deep level of Orleans code? It is just unthinkable IMHO.

I think we are giving too much thoughts and over-complicating things here while my suggestion would handle this seamlessly without much changes at the codebase and will work cross-plat without extra dependencies and can even be a definitive solution and ignore when PerfCounters come from CoreFX.

galvesribeiro commented 7 years ago

Just to give an idea, a simple call to getrusage from libc, would give me a nice struct with almost all the counters used in Orleans today per process, or by an specific thread http://man7.org/linux/man-pages/man2/getrusage.2.html

There are even easier methods to be PInvoked on Win32 APIs.

jason-bragg commented 7 years ago

no compile time checks on the calls Why?

Because this compiles: [DllImport("IDontExist.dll")] public static extern int workdamnu(int i);

There may well be methods I'm not aware of that make this safer. I don't know what you're referring to by " if you do it right".

galvesribeiro commented 7 years ago

[DllImport("IDontExist.dll")]

That will never happen... There are only 2 libs... kernel32.dll on windows and libc.so on *NIX systems... Well, if those don't exist I guess we will not have even .Net Framework running there...

By do it right, I mean to use proper structures and not do magic with unsafe code or pointers. Just clean method stubs that will read the proper data. Simple as the method I just mentioned in previous comment.

xiazen commented 7 years ago

it is not only whether those libs exists or not, but also, based on different system environment config, those libs may exists in different location, while we are not sure whether PInvoke knows where to load them.

I think that's one of the main concerns of PInvoke doesn't have a compile checks.

xiazen commented 7 years ago

some link I found on internet about location of libc.so as an example : http://askubuntu.com/questions/40416/why-is-lib-libc-so-6-missing

galvesribeiro commented 7 years ago

@xiazen you dont need to care about the physical path of those libs... Like I said, if done right, you will have no problem...

xiazen commented 7 years ago

I was under the impression that performance counters is back to netstandard 2.0. So we just need to enable perf counter again in our ThreadTrackingStatistic .

But regardless, I think this issue isn't accurate anymore. We should merge it with other newer issues which is more accurate. #3586 and #3567 are all talking about similar issue. I think we can close this one and track the other two instead.

galvesribeiro commented 7 years ago

Perf counters are back if you are using Windows. We need evolve the telemetry v2 so we have an abstracted way to record those counters regardless of OS. After all, we want to have feature parity across all netstandard supported platforms.

xiazen commented 6 years ago

close this issue because this is a duplicate of #3567

dotnet / orleans

Remove performance counters from ThreadTrackingStatistic #2609