containers / nri-plugins

A collection of community maintained NRI plugins
https://containers.github.io/nri-plugins/
Apache License 2.0
67 stars 24 forks source link

balloons: add support for cpu frequency governor tuning #374

Closed fmuyassarov closed 1 month ago

fmuyassarov commented 2 months ago

This commit introduces a new field in the balloons configuration CR, enabling users to specify CPU frequency governor mode preferences.

fmuyassarov commented 2 months ago

@askervin @klihub PTAL. Note that I didn't add a commit for the docs assuming that there might be some changes with the naming or something else. What you guys are happy with the change I will add a commit to update the documentation on this same PR.

fmuyassarov commented 1 month ago

LGTM to me. I skimmed through this once, and if I got the naming wrt. config hierarchy right, this looks fine to me.

My only question is that since ATM we can't test this functionality in our e2e test VMs, have you tested this on real HW in practice ?

Yes, I have tested it on my own machine and it works as expected. Fro example, I was checking the updated governor mode of a core belonging to a balloon by simply cat /sys/devices/system/cpu/cpuN/cpufreq/scaling_governor

askervin commented 1 month ago

I really wish we had some way to test this in our e2e tests, too. For instance, if we have idleCPUClass defined and no balloons (except for the reserved an possibly the default) to start with, we should have CPUs configured with idle class frequencies and governor.

@fmuyassarov, if this runs in a vm, do we get log error messages about failing to set a governor for a CPU? From those errors we could verify that we at least tried to do something sensible.

Another reason why this should have an e2e test is that we wish to see what happens when someone uses a balloons configuration with a governor in a vm. After all, it's very much possible that the policy runs in a vm, and might (possibly accidentally) have CPU classes defined, too.

klihub commented 1 month ago

I really wish we had some way to test this in our e2e tests, too. For instance, if we have idleCPUClass defined and no balloons (except for the reserved an possibly the default) to start with, we should have CPUs configured with idle class frequencies and governor.

@fmuyassarov, if this runs in a vm, do we get log error messages about failing to set a governor for a CPU? From those errors we could verify that we at least tried to do something sensible.

Another reason why this should have an e2e test is that we wish to see what happens when someone uses a balloons configuration with a governor in a vm. After all, it's very much possible that the policy runs in a vm, and might (possibly accidentally) have CPU classes defined, too.

One possibility is to use similar plumbing than we use for testing controller-agnostic hook triggers in balloons/n4c16/test21-controller-check.

That would require a few changes to the cpu controller when ENABLE_TEST_APIS is set in the environment:

In the handler for the HTTP endpoint you dump something like this as JSON

type state struct {
     Classes map[string]Class
     Assignments map[string]idset.ID // might be easier to have map[string]string instead
}

Then in the tests, we use this endpoint just like we now do for querying metrics, but we fetch data from /cpu-control-state instead of /metrics. Since the exposed data is JSON in a known format controlled by us, we can verify the expected test result with jq queries.

fmuyassarov commented 1 month ago

@fmuyassarov, if this runs in a vm, do we get log error messages about failing to set a governor for a CPU? From those errors we could verify that we at least tried to do something sensible.

I can verify it. My previous tests were on physical machine.

fmuyassarov commented 1 month ago

Then in the tests, we use this endpoint just like we now do for querying metrics, but we fetch data from /cpu-control-state instead of /metrics. Since the exposed data is JSON in a known format controlled by us, we can verify the expected test result with jq queries.

As we discussed offline, let's keep it in mind and I will try to allocate some time for setting up something similar that you described here.