Icinga / icinga-powershell-framework

This PowerShell module will allow to fetch data from Windows hosts and use them for inventory and monitoring solutions. Together with the Icinga Web 2 module, a detailed overview of your Windows infrastructure will be drawn.
MIT License
78 stars 33 forks source link

High CPU usage #131

Closed slalomsk8er closed 8 months ago

slalomsk8er commented 4 years ago

I tested a set of Service Checks and got a 18% increase in CPU usage on a 2 core virtual server.

A 18%+ increase in CPU usage for monitoring over the whole cluster is not acceptable for us.

The configured services

image

Icinga 2 + Icinga PowerShell Service enabled

image

Icinga 2 + Icinga PowerShell Service disabled

image

Any Ideas on reducing the impact of the checks?

LordHepipud commented 4 years ago

Hello and thank you for the report. Indeed on machines with less cores the CPU impact might be higher. In general the performance impact should only be a short peek during the call and not exceed a longer period.

Is the reported high CPU usage only present in short "bursts" or is the entire CPU usage constantly higher? We are currently investigating different solutions to decrease the overall impact of the Framework. Right now it would be important for me to understand the impact on the system itself, if the load is in general higher or if it only increases during the exection of plugins.

slalomsk8er commented 4 years ago

The high CPU usage happens in bursts but I think they aren't so short.

image image image

Icinga2 & Icinga PowerShell Service disabled

image image image image

I measured on a different server that had no service checks activated yet and it's flat same as if I disabled the services.

Also I see the peaks coresponding to the PowerShell process

image image

vr255 commented 4 years ago

Hello,

we can also see this behaviour in our environment. Mostly when the CPU check is executed.

grafik

This has the unpleasant side effect that we get a lot of false alarms.

LordHepipud commented 4 years ago

Thank you for all the detailed reports. We are already taking a look on this to figure out how we can reduce the overall impact during plugin execution.

The biggest "issue" is that plugins do not remember their last state, which means all Performance Counter and internal objects have to be re-initialised.

drapiti commented 4 years ago

Hi, we also have this problem and on some critical systems we have had to disable the agent and service, if there is anything we can do to help we are available.

LordHepipud commented 4 years ago

To get a better understanding of the current impact and to provide possible solutions in the future (and for internal testing) it would be helpful get some additionals data:

This will help a lot to build proper test environments to see where we can reduce overall impact.

drapiti commented 4 years ago

Here is the info of the applied service set:

image

These checks are all using the latest powershell framework. This is the current base service set applied to all windows machines. We have many VMs with 1 to 2 vcpu machines which are often at high load therefore these checks really kill the system. Cpu's speed is typicall between 2.1-2.6 mhz but cpu usage is apparent on all systems.

The below graph is quite explicit, the cpu and memory initially at 1 minute intervals, then at 5m and finally the last step in the graph we removed the entire above service set from the specific machine. This machine has 2 vcpu running at 2.6mhz.

image

Note. Currently our systems are running the ms scom agent which is consuming 1/3 of the resources while collecting much more data including the information provided in the above graph.

Other details of the specific machine without any of the above icinga services:

image

I will add that we have necessarily needed to suspend all monitoring on windows systems, so this is quite critical. Our users will not accept this type of impact on their systems.

slalomsk8er commented 4 years ago

the requested list with &addColumns=service_check_interval addColumns=_service_check_interval_1

2 vCPUs @ 2.1 GHz

LordHepipud commented 4 years ago

Thanks a lot for the input. We will dig into this and see if we can find a long-term solution to reduce the impact.

LordHepipud commented 4 years ago

I created the linked PR #142 which addresses this issue a little by adding the experimental feature for caching the entire Framework code. In case possible, it would be great if you could test this and see if it mitigates the current issue you are having.

slalomsk8er commented 4 years ago

I did deploy the linked framework and checks by downloading the zip of the feature branches and installing them by hand and didn't see any difference on my testserver with my set of 9 checks. Maybe my understanding of PowerShell is lacking and the version from the PowerShell Gallery was still used even after I moved it to the recycle bin.

Sadly I don't have more time this week to do more tests and we found a solution with the Linuxfabrik - Python checks which results in a reduction in CPU load by 3/4 compared to PowerShell.

LordHepipud commented 4 years ago

Thank you for the feedback. Did you enable the caching handling with Enable-IcingaFrameworkCodeCache after testing it?

If it was and there is no performance uplift at at all, we need to keep tweaking it.

drapiti commented 4 years ago

We are testing the cache on a few servers and it is definately much improved. Will check back next week to see how it goes on low resource VMs.

slalomsk8er commented 4 years ago

You were right, I missed to enable the cache in my last try. The latest test with enabled cache showed a change from 18% down to 14% CPU - still not the 3/4 ~95% reduction that the switch to python provided.

PS_cache_1

I don't know PowerShell enaught but I have a fealing that only a miracle can save this approch. My next test will be on what Nuitka can do to optimise the Python checks.

BTW on Linux 25 Python checks increased the CPU usage by 2% - so a big part could be Windows and/or antivirus.

Edit: Testing error with Python checks - I missed the 2 Scheduled Tasks Checks that were still PowerShell

LordHepipud commented 4 years ago

Thank you all for the feedback. Yes, the "first" initialisation is more ressource intensive than other solutions. Right now we are working on a way to mitigate the current impact on systems with fewer resources available - so every test and feedback is very helpful.

On the other hand we are already working on a long-term solution which will decrease the impact of the plugins by a way bigger margin.

drapiti commented 4 years ago

Just to update from last week and confirm what we are seeing. The cache has definately helped as at least the servers can perform their primary functions and we have not recieved any other negative feedback. Without the cache we needed to suspend the checks completely so i think the cache should definately be the default setting. Resource consumption is still a little on the high side compaired to other solutions so it would be great if it could be tweaked a little more. In any case I think it is almost there, great work.

LordHepipud commented 4 years ago

Thank you very much for the positive feedback and the tests! It's very much appreciated!

Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin? As far as I can tell, the plugins itself are not causing too much of an issue - can you confirm this?

drapiti commented 4 years ago

Thank you very much for the positive feedback and the tests! It's very much appreciated!

Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin? As far as I can tell, the plugins itself are not causing too much of an issue - can you confirm this?

Yes this is what we have done for the moment on select servers, maintained the current stable plugins, only activated the cache. Did we need to update the plugins also? We will wait for the 1.3 version of the framework before activating on all servers.

LordHepipud commented 4 years ago

Thanks, yes thats what I wanted to test. Because I made some experimental changes to the plugins as well. I merged the PR now into the master, as I didnt get any issues as well during testing.

I will keep this issue open for now, in case something occurs.

drapiti commented 3 years ago

Hi, just wanted to add to this thread without opening a new issue. We have also seen some abnormal behaviour with the memory consumption relating to powershell processes. After a few days maybe a week the memory consumption slowly increases causing various problems on the systems. This is prior to the caching feature activation so I don't know yet if this is mitigated. We only see this after a few days so not as immediate as the cpu problem. Hopefully this can be addressed in the 1.4 release. Also to note, this is happening even when no services are active on the system. Just having the icinga and icinga_powershell service running with no plugins active the memory slowly uses all system resources on the powershell processes. We resolved it by disabling the services.

smarsching commented 3 years ago

I read that the C++-based plugins have been deprecated in favor of the PowerShell plugins, so I wanted to give them a try. However, we are having problems with the performance of the Icinga PowerShell Framework as well:

Without the cache, powershell -Command "Use-Icinga; exit Invoke-IcingaCheckMemory" takes about 15 seconds (!) and uses about 50% of CPU resources (on a virtual machine with two vCPUs).

With the cache enabled, this is reduced to about 8 seconds, but this is still way too much (we typically want to execute checks once every 30 seconds, and with just a few checks (CPU, memory, disk, some services), this would mean that most of the CPU resources would be occupied by Icinga checks, while at the moment (with the old C++-based checks), they hardly use any resources.

It seems like most of these resources are needed for initialization. When I run Use-Icinga in an interactive PowerShell instance, this takes about three seconds (with the cache enabled). The followingInvoke-IcingaCheckMemory takes considerably longer, but only for the first call. All subsequent calls within the same session are very fast.

As I understand it, this is where the Icinga PowerShell service comes into play: By repeatedly running code within a single session, the initialization overhead can be avoided. However, I do not understand yet how I can configure Icinga to run the checks through the PowerShell service instead of creating a new PowerShell instance for each check. Is this even possible at the moment?

LordHepipud commented 3 years ago

As some time has passed, I wanted to post a follow up on this issue.

Internally we are still discussing on how we can resolve this issue. There are several ideas around we now need to investigate in which are the most user and maintaining friendly without adding multiple complexity layers.

We will keep you posted once a decision was made.

@smarsching : This is indeed the biggest issue, as once everything is loaded the check execution is superior to anything else. The initial loading is the problem right now and requires a proper solution

@drapiti: I did some tests of the the past weeks now could not reproduce this issue. How many PowerShell processes are open on your machine? For the background service, there should only be one PowerShell.exe running.

Did you register any background daemons on your environment? You can check this with

Get-IcingaBackgroundDaemons

If no daemon is configured, there should be nothing being executed and increase the memory consumption of the system.

drapiti commented 3 years ago

@LordHepipud No at the time this this was without caching and without any background daemons. The powershell process for icinga was typically just the one. It may have been possible that there were other powershell processes running on the systems I cannot exclude that. The memory usage was however abnormal only on the icinga powershell process. Since then we have updated the icinga agent and enabled the caching on certain systems and we have not noticed the issue since. We will test again on dedicated low spec machines in the next few days.

drapiti commented 3 years ago

Ok, I have again found the problem/memory leak on 3 or 4 servers. Icinga version 2.12.3 with caching enabled no background daemons. There is indeed another powershell process running with a different user because we also have SCOM on these systems at the moment. In any case the powershell process running with the network service user is the one we are interested in as it is related to the icinga powershell service. I have 4 identical systems, Windows Server 2016 12vcpu, 16gb ram, all have the memory usage issue.

image image image

After restarting the icinga powershell service it releases roughly 10Gb/11Gb of ram:

image

The longer I leave it the more ram it uses. Now below is the current situation 1Gb is used and this is I think roughly 24 hours since I restarted the service may be slightly more. So the memory creeps up slowly it is not immediate but it is definately a problem. image

LordHepipud commented 3 years ago

This is actually insanve. Can you please run Get-IcingaBackgroundDaemons and provide a list of daemons configured there? In addition, this might be required as well: Show-IcingaRegisteredServiceChecks

It seems like there is a memory leak somewhere. However, the daemon is running on my test machines with configured checks for weeks and the maximum memory usage is around 400 MB.

This is a problem we require to resolve.

drapiti commented 3 years ago

image

PS C:\Users\XXXXX> Show-IcingaRegisteredServiceChecks [Notice]: Service Id: 1142162233510423583126106156166114149146101222229139227 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckCheckSum Interval 30 Arguments Id 1142162233510423583126106156166114149146101222229139227 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 178831791402541981931455015611214523419424171869310678 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckProcessCount Interval 30 Arguments Id 178831791402541981931455015611214523419424171869310678 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 205134180237311434244121159271702471059476122351378 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckPerfcounter Interval 30 Arguments Id 205134180237311434244121159271702471059476122351378 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 7819925144220911212211541452062010520591872255 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckNLA Interval 30 Arguments Id 7819925144220911212211541452062010520591872255 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 8162534614323820158551961061621231301622501302888140 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckICMP Interval 30 Arguments Id 8162534614323820158551961061621231301622501302888140 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 551181381651201999422715187105120114403514417424517438 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckFirewall Interval 30 Arguments Id 551181381651201999422715187105120114403514417424517438 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 4418915053184230706218648811043538379142187245241 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckDirectory Interval 30 Arguments Id 4418915053184230706218648811043538379142187245241 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 9610010422515761031115782271359412823168106218228198 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckCertificate Interval 30 Arguments Id 9610010422515761031115782271359412823168106218228198 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 71228172200371342936213124133145226933171202162250 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckUpdates Interval 30 Arguments Id 71228172200371342936213124133145226933171202162250 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 7036121221531978481915358248175219721422021816290 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckUsedPartitionSpace Interval 30 Arguments Id 7036121221531978481915358248175219721422021816290 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 225104251186107973819916186121177195202524160180123241 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckMemory Interval 30 Arguments Id 225104251186107973819916186121177195202524160180123241 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 22916951791872245919221925523725213411516976176244 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckUsers Interval 30 Arguments Id 22916951791872245919221925523725213411516976176244 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 561201072111290217166515465914226141181963228 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckService Interval 30 Arguments Id 561201072111290217166515465914226141181963228 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 37531985071189209661626117547296989931088583 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckUptime Interval 30 Arguments Id 37531985071189209661626117547296989931088583 TimeIndexes {1, 5, 15}

[Notice]: Service Id: 71649014336641522551707820712518416824890151141790 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckBiosSerial Interval 30 Arguments Id 71649014336641522551707820712518416824890151141790 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 1071411171146424418212311684952193921312318819924195214 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckEventlog Interval 30 Arguments Id 1071411171146424418212311684952193921312318819924195214 TimeIndexes {1, 3, 5, 15}

[Notice]: Service Id: 5275219864641021224811420224776891459631192206 [Notice]: Name Value


CheckCommand Invoke-IcingaCheckCPU Interval 30 Arguments Id 5275219864641021224811420224776891459631192206 TimeIndexes {1, 3, 5, 15}

LordHepipud commented 3 years ago

Thank you for the the input! I will register the services one by one to check where the issues occures from.

drapiti commented 3 years ago

Hi, just to update. We have begun testing the 1.4 release although the memory leak looks to be still present. Cheers

LordHepipud commented 3 years ago

Lets discuss the memory leak in #224 and use this for CPU usage only.

RomainPisters commented 3 years ago

Is there any update regarding this issue? We face the same problems, where the extra load of the powershell checks often result in services going to the UNKNOWN state because of Remote Icinga instance 'hostname.fqdn.com' is not connected to 'ouricinga.fqdn.com'

drapiti commented 3 years ago

The cpu consumption remains to be a problem for ourselves also hopefully a definitive solution is in the works. All memory issues are now resolved, however our users are complaining about the 25% cpu resources required for the powershell process which is linked to the icinga powershell framework. Any updates on this? Cheers.

LordHepipud commented 3 years ago

We are still working on this topic but might have reached a point on where there is not much to improve anymore. Right now we believe the best approach is to move the plugin execution to an interhal REST-Api.

We added an experimental feature with 1.4.0 for this: https://icinga.com/docs/icinga-for-windows/latest/doc/experimental/01-Forward-checks-to-internal-API/

Would it be possible to test this experimental feature and give us feedback if the overwall performance impact is gone?

drapiti commented 3 years ago

Begun testing so far looks much much better. Check back next week. Thank you

LordHepipud commented 3 years ago

Have you been able to test the experimental feature yet and verify if the performance impact was reduced in general? How are plugin exceuction times?

drapiti commented 3 years ago

Yes this is now working much better after having deployed a couple hundred agents, consumption is now negligible. Execution time is typically between 2 and 5 seconds depending on the plugin, not quite as low as the linux plugins which are typically below 1s. For the most part the powershell process is at 0% cpu consumption. Also we have maintained the caching mechanism. To be honest I have run into a different issue at the moment but I don't think it's related to the windows agents. As we have now distributed the agent to a couple hundred windows machines for some reason one of my two masters is now crashing after a few hours. Will check this on the icinga thread. There seems to be a memory leak on the config master running version 2.12.3, no problems on the second master.

haxtibal commented 3 years ago

@LordHepipud @drapiti

Execution time is typically between 2 and 5 seconds depending on the plugin, not quite as low as the linux plugins which are typically below 1s.

We also see 2..5 seconds, but worse, on some of our machines plain PowerShell initialization causes really high CPU load. We see ridiculous 75% load for ~5s by just starting powershell.exe -c Exit in a cmd.exe, and in consequence the same when Exit-IcingaExecutePlugin is doing the HTTP POST to the API.

As mitigation I've started a PoC project i4w_callapi which is basically a replacement of Exit-IcingaExecutePlugin, but written in Rust to perform the REST API call without overhead. Usage is

> call_api_check.exe -c Invoke-IcingaCheckCPU -- -Warning 50 -Critical 90
[OK] Check package "CPU Load" | 'core_4'=3.378296%;50;90;0;100 'core_total'=3.948704%;50;90;0;100 'core_6'=3.345666%;50;
90;0;100 'core_5'=3.39417%;50;90;0;100 'core_7'=3.038184%;50;90;0;100 'core_0'=5.790892%;50;90;0;100 'core_2'=4.960417%;
50;90;0;100 'core_1'=3.999451%;50;90;0;100 'core_3'=3.682836%;50;90;0;100

call_api_check.exe can be wrapped into a object CheckCommand and is statically linked for easy redistribution. If you think this could be useful for a broader audience I'd invite you to join discussion on how it can evolve. No binary release yet, I'll create one if there's interest.

Finally, thank's for your effort, we appreciate that a lot. #131 was show stopper like for us, and now with #204, Icinga/icinga-powershell-restapi#4, Icinga/icinga-powershell-apichecks#4 and i4w_callapi it seems finally solved.

drapiti commented 3 years ago

@LordHepipud @drapiti

Execution time is typically between 2 and 5 seconds depending on the plugin, not quite as low as the linux plugins which are typically below 1s.

We also see 2..5 seconds, but worse, on some of our machines plain PowerShell initialization causes really high CPU load. We see ridiculous 75% load for ~5s by just starting powershell.exe -c Exit in a cmd.exe, and in consequence the same when Exit-IcingaExecutePlugin is doing the HTTP POST to the API.

As mitigation I've started a PoC project i4w_callapi which is basically a replacement of Exit-IcingaExecutePlugin, but written in Rust to perform the REST API call without overhead. Usage is

> call_api_check.exe -c Invoke-IcingaCheckCPU -- -Warning 50 -Critical 90
[OK] Check package "CPU Load" | 'core_4'=3.378296%;50;90;0;100 'core_total'=3.948704%;50;90;0;100 'core_6'=3.345666%;50;
90;0;100 'core_5'=3.39417%;50;90;0;100 'core_7'=3.038184%;50;90;0;100 'core_0'=5.790892%;50;90;0;100 'core_2'=4.960417%;
50;90;0;100 'core_1'=3.999451%;50;90;0;100 'core_3'=3.682836%;50;90;0;100

call_api_check.exe can be wrapped into a object CheckCommand and is statically linked for easy redistribution. If you think this could be useful for a broader audience I'd invite you to join discussion on how it can evolve. No binary release yet, I'll create one if there's interest.

Finally, thank's for your effort, we appreciate that a lot. #131 was show stopper like for us, and now with #204, Icinga/icinga-powershell-restapi#4, Icinga/icinga-powershell-apichecks#4 and i4w_callapi it seems finally solved.

Very interesting, I’m wondering if it’s possible to avoid having the extra rust binary altogether and just build the api calls into the icinga agent.

haxtibal commented 3 years ago

Very interesting

Just created a binary pre-release, I'm happy to receive comments and suggestions over there.

I’m wondering if it’s possible to avoid having the extra rust binary altogether and just build the api calls into the icinga agent.

That would be really nice. Breaking API changes could be handled Icinga-internally, and users save downloading an external binary. The tech stack is ready, Icinga2 agent already depends on Boost.Asio, Boost.Beast and nlohmann/json. @LordHepipud @Al2Klimov Any chance we'll get there one day?

Al2Klimov commented 3 years ago

IMAO messing up the Icinga address space with even Perl #7461 is not the best idea. And a machine-near language which has arbitrary memory R/W/X access... O.O

IMAO N*****/Icinga users have always downloaded and will always download extra binaries. But nowadays it's even easier w/ provisioning.

haxtibal commented 3 years ago

IMAO messing up the Icinga address space with even Perl #7461 is not the best idea. And a machine-near language which has arbitrary memory R/W/X access... O.O

Sure! The idea was not to compile/link/exec Rust code within the agent (it's not that bad, but I absolutely agree one wants to avoid such experiments for little benefit). The idea here is to rewrite the few lines in C++, commit them into icinga2, and compile it as optional target like done with some other check_*.exe tools already included.

Al2Klimov commented 3 years ago

You mean https://icinga.com/docs/icinga-2/latest/doc/10-icinga-template-library/#check-commands ?

haxtibal commented 3 years ago

You mean https://icinga.com/docs/icinga-2/latest/doc/10-icinga-template-library/#check-commands ?

Yes, or perhaps windows-plugins-for-icinga-2 because it's Windows specific?

There could be a new check_icinga_ps_restapi.cpp next to check_nscp_api.cpp, check_procs.cpp and so on. It would talk to your new icinga-powershell-apichecks via http. I know these plugins are marked as

DEPRECATED in favor of our PowerShell Plugins

but as it turns out, PowerShell at the agent side has the drawback of adding some seconds delay in best case and burning CPU time in worst case.

Don't want to push on this, we're quite happy with your recent inventions and the existing patch set. It just seemed reasonable if there was a "consume-your-own-API" tool.

Al2Klimov commented 3 years ago

Wait! There is a consume-your-own-API check command: https://icinga.com/docs/icinga-2/latest/doc/10-icinga-template-library/#dummy

E.g. ...

object Host "h" {
    check_command = "dummy"
    vars.dummy_text = {{ " |hosts=" + len(get_objects(Host)) }}
}

... leads to the perfdata hosts=1 for me.

haxtibal commented 3 years ago

Wait! There is a consume-your-own-API check command

Guess we're thinking about different things. I meant "your-own-API" = https//localhost:5668/v1/checker?command=<icinga_powershell_cmdlet> as implemented with #204, and "consume-your-own-API" = a (native) check plugin that translates between the Icinga /v1/checker API and the Icinga agent. Sorry for being unclear. The command you gave wouldn't do that translation, would it?

You almost have a solution, namely Invoke-IcingaInternalServiceCall. The only issue with it is that it's written in PowerShell. PowerShell is fine for Icinga for Windows in general, but not for this small portion where we need something that can be forked of really fast and efficiently.

Al2Klimov commented 3 years ago

Yes, with my solution you'd have to translate it by yourself.

log1-c commented 3 years ago

Here is some additinal feedback regarding the frameworks cpu resource usage. Windows License Server (2vCPUs)

at 28.05.21 10:30:
added excludes to the Windows Defender for the icinga2 and powershell process as well as the powershell-framework and powershell-plugins folders

Previously the server had an avg cpu load of ~11%

after adding the excludes this changed to ~9 % avg image

at 02.06.2021 ~15.00: activation of the PowerShell Framework API-Features (red line in following picture) load spikes have reduced drastically but Load-Avg keeps growing over time image at 07.06.2021 @ 13:30 Uhr: restarting icingapowershell Service after this the avg dropped again, but kept climbing continuously over time again image 30 day history image

here is a small side-by-side comparison of two Windows terminal servers, as well as a simple Windows VM without any hardening/gpos/domain join, experiencing the CPU spike. The yellow marked space was an execution of 3 check simultaneously image

and here is another comparison of those two Windows Terminal Servers. both neither have the API features nor the average collection enabled image

upper graph of the server checked by the EXE-Files. Additionally has one check via the powershell framework (cpu check)

lower graph check by the powershell framework. Additionally has one check via the EXE files (cpu check)

log1-c commented 3 years ago

some more screenshots incoming :)

Windows VM with 1 CPU core CPU Usage of 50% on average and many spike of up to 100%, rendering the VM unusable, when checks are coming in. image

enabled the API feature at around 08:00 at 10.06.2021. As can be seen in the graph this reduces the peaks, but the avg load remains at ~50%. image

following are some screenshot of the task manager of the VM, where the executed checks can be seen: image image image

what honestly wonders me is, that even after enabling the API feature, there are still powershells spawned for each check. I thought the experimental api feature should reduce/prevent that, or am I mistaken here? image

what also wonders me, is that the PowerShell Daemon running to collect the average cpu usage has a constant load of at least 5%, up to 25% image

image

LordHepipud commented 3 years ago

Thank you for all the input so far. Two answer your questions: The API feature will still spawn a PowerShell process, but only loads small core-components and then triggers the check over the REST-Api Therefor you will still see PowerShell processes spawning, but which a in general shorter time spawn and less CPU usage.

The daemon you are talking about is doing everything in the background. It will collect your CPU metrics over time but also accept your incoming API requests and handle them. Thats why you see a low to mid-low CPU usage there, because every single check is executed within this daemon scope.

smarsching commented 3 years ago

I think that the solution that @haxtibal mentioned is designed specifically to avoid these processes from being spawned, further reducing CPU load (refer to his comments from May for details).

SamIX7 commented 3 years ago

Thanks @log1-c I'm now using the old executables, which work a bit better and on lower performance.