Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.02k stars 577 forks source link

Analyse check execution performance #7425

Open dnsmichi opened 5 years ago

dnsmichi commented 5 years ago

Story

TL;DR - performance analysis and debugging on the throughput of checks. Scroll down for problems, technical analysis and solutions.

Issues

ref/NC/628417 ref/NC/625049

Problems

Performance

The multi-plexed unix socket pair with pipe streams and event FD polling introduces several locks with concurrent check runs waiting for each other. Also the throughput with the thread pool is blocked with this.

7421 exposes this with current_pending_callbacks as metric.

7416 discusses a problem where the actual running checks are different to what's actually executed as processes. Whenever the thread pool is locked up, no more additional checks can be executed. This is different to the actual running processes though.

In theory, the throughput of executed processes must be faster than 1000 per second for instance.

Too many open files

Whenever MaxConcurrentChecks is raised to a higher limit than the system's open file descriptors, the process pipes may exceed this limit.

The following fork() calls then resulted in errors with too many open files preventing the fork thus returning into a critical check result.

This can be solved with raising the number of open files within Systemd and Icinga 2.

Late Check Results

Technical Details

Check Execution Flow

Check Scheduler

The checker component works as check scheduler, each checkable object is stored with its next execution time.

Once a check should be executed, there's additional guards in place to ensure that not all 10000 checks run in the same second for instance (MaxConcurrentChecks). At this time, the scheduler waits up until the remaining running processes finish and free up slots.

The check task is posted into the application's thread pool, and the scheduler continues with other checkables.

lib/checker/checkercomponent.cpp

Utility::QueueAsyncCallback(std::bind(&CheckerComponent::ExecuteCheckHelper, CheckerComponent::Ptr(this), checkable));

Global Task Thread Pool

Once the thread pool works on the task, the registered callback function CheckerComponent::ExecuteCheckHelper is executed.

lib/base/threadpool.hpp

                        boost::asio::post(*m_Pool, [this, callback]() {
...
                                try {
                                        callback();
                                } catch (const std::exception& ex) {
...
                                }
                        });

The callback function actually runs the check and instruments the scheduler upon finishing later.

lib/checker/checkercomponent.cpp

                checkable->ExecuteCheck();

Asynchronous Check Execution

ExecuteCheck prepares the check result and also decides whether to run the check locally, or via command endpoint as cluster message. The latter involves resolving the macros with a simulated check run first. This is not scope of this workflow though.

A local check is fired based on the check_command attribute and its registered execution method.

lib/icinga/checkable-check.cpp

                GetCheckCommand()->Execute(this, cr, nullptr, false);

Generic Task Execution Functions

The following requires knowledge about the default templates and also execute methods for commands. Older versions of Icinga required to import the plugin-check-command template which has hidden the following from the user:

execute = PluginCheck

Newer versions of Icinga have that as default and for example a CheckCommand already knows that it will execute a PluginCheck task.

There are a few other tasks registered, e.g. for built-in memory checks like icinga, cluster-zone, etc. These built-in check command register different tasks than PluginCheck. This can be seen in lib/methods/methods-itl.conf which is compiled into the binary.

The CheckCommand Execute() method gets the execute attribute and calls the registered Invoke() interface. This allows to have multiple execution types with a single implementation (OOP).

lib/icinga/checkcommand.cpp

        GetExecute()->Invoke({

PluginCheck Task

The PluginCheck task has a script function registered. Each method is put into the lib/methods library which is highly extendable (for developers) with the simple interface used here.

lib//methods/pluginchecktask.cpp

REGISTER_FUNCTION_NONCONST(Internal, PluginCheck,  &PluginCheckTask::ScriptFunc, "checkable:cr:resolvedMacros:useResolvedMacros");

The macro takes care of fetching the ScriptFunction from the global context, being PluginCheck in this case. The registered callback function PluginCheckTask::ScriptFunc is then invoked on call.

lib/methods/pluginchecktask.cpp

void PluginCheckTask::ScriptFunc(const Checkable::Ptr& checkable, const CheckResult::Ptr& cr,
        const Dictionary::Ptr& resolvedMacros, bool useResolvedMacros)

A few preparations are taking place before check execution:

Then again, it fires a generic implementation for executing a command.

lib/methods/pluginchecktask.cpp

        PluginUtility::ExecuteCommand(commandObj, checkable, checkable->GetLastCheckResult(),
                resolvers, resolvedMacros, useResolvedMacros, timeout,
                std::bind(&PluginCheckTask::ProcessFinishedHandler, checkable, cr, _1, _2));

Note the registered callback function ProcessFinishedHandler which later takes care on creating the final check result with output, status from exit code, etc.

Prepare and Execute the Command

The function interface has quite a few things to pass which are needed for check execution.

lib/icinga/pluginutility.cpp

void PluginUtility::ExecuteCommand(const Command::Ptr& commandObj, const Checkable::Ptr& checkable,
        const CheckResult::Ptr& cr, const MacroProcessor::ResolverList& macroResolvers,
        const Dictionary::Ptr& resolvedMacros, bool useResolvedMacros, int timeout,
        const std::function<void(const Value& commandLine, const ProcessResult&)>& callback)

Now the following happens:

                command = MacroProcessor::ResolveArguments(raw_command, raw_arguments,
                        macroResolvers, cr, resolvedMacros, useResolvedMacros);

If that fails, the error is captured as output and returned as UNKNOWN check result to the caller.

        Process::Ptr process = new Process(Process::PrepareCommand(command), envMacros);

        process->SetTimeout(timeout);
        process->SetAdjustPriority(true);

Then the final process is being run. This re-binds the callback function to parse the returned output and exit state later on.

        process->Run(std::bind(callback, command, _1));

Enter the Process rabbit hole

The previous steps are all required to support passing check command parameters via custom variables from the DSL. The macro resolver also takes care about running the lambda functions at runtime for instance.

Now that the command line is in full flux, we continue. The following parts reflect the implementation in Icinga 2.10.

lib/base/process.cpp

void Process::Run(const std::function<void(const ProcessResult&)>& callback)

At this stage, Linux and Windows differ a bit. The following description focuses on Linux/Unix where we support running the check scheduler in high performance environments, Windows are just for low energy agent checks.

lib/base/process.cpp

        m_Result.ExecutionStart = Utility::GetTime();

lib/base/process.cpp

        m_Process = ProcessSpawn(m_Arguments, m_ExtraEnvironment, m_AdjustPriority, fds);

The returned process is stored for later asynchronous event processing with reading the output from the plugin process. This is based on a limited number of IO threads, 8 in 2.10.

        int tid = GetTID();

        {
                boost::mutex::scoped_lock lock(l_ProcessMutex[tid]);
                l_Processes[tid][m_Process] = this;
#ifndef _WIN32
                l_FDs[tid][m_FD] = m_Process;
#endif /* _WIN32 */
        }

Process Spawn Helper

This is the secondary process visible with pidof. It listens on a Unix socket pair for event messages from the main process.

It gets created via StartSpawnProcessHelper as a forked process, and starts listening in the ProcessHelper() function as endless loop.

Decodes the received JSON message and fires specific processes:

                if (command == "spawn")
                        response = ProcessSpawnImpl(&msg, request);
                else if (command == "waitpid")
                        response = ProcessWaitPIDImpl(&msg, request);
                else if (command == "kill")
                        response = ProcessKillImpl(&msg, request);
                else
                        response = Empty;

spawn executes a command, waitpid waits until the command process is finished, and kill ensures that plugin timeouts are correctly dealt with.

The response is JSON-encoded and sent back via unix socket pair.

                String jresponse = JsonEncode(response);

                if (send(l_ProcessControlFD, jresponse.CStr(), jresponse.GetLength(), 0) < 0) {

Process Spawn

This creates a spawn command which instruments the process spawn helper to create a new process for this plugin execution.

        do {
                while (sendmsg(l_ProcessControlFD, &msg, 0) < 0) {
                        StartSpawnProcessHelper();
                }
        } while (send(l_ProcessControlFD, jrequest.CStr(), jrequest.GetLength(), 0) < 0);

l_ProcessControlFD links to the Unix socket pair descriptor which is used for communicating with the process spawn helper. This sequence also holds the pipe for stdout/stderr for the plugin's process.

Actually spawn a process and execute something

This is done by the helper process in ProcessSpawnImpl().

Then the child process is forked.

        pid_t pid = fork();

        int errorCode = 0;

        if (pid < 0)
                errorCode = errno;

        if (pid == 0) {
                // child process

Its priority is niced, if existing.

The actual execution happens in the built-in icinga2_execvpe which uses an optimized vfork() implementation trimmed for best performance.

                if (icinga2_execvpe(argv[0], argv, envp) < 0) {
                        char errmsg[512];
                        strcpy(errmsg, "execvpe(");
                        strncat(errmsg, argv[0], sizeof(errmsg) - strlen(errmsg) - 1);
                        strncat(errmsg, ") failed", sizeof(errmsg) - strlen(errmsg) - 1);
                        errmsg[sizeof(errmsg) - 1] = '\0';
                        perror(errmsg);
                        _exit(128);
                }

The output is written directly on the FD which is mapped as pipe to the main process. You can see that with lsof -p <mainprocessPID>.

The errno and pid is returned.

The Async Process Event Loop

Again in the main process in void Process::IOThreadProc(int tid) - this gets spawned during the startup of Run(), just once.

Note

Now it is getting wild.

There are several registered IO threads, namely 4. All of them listen on the registered poll FDs from the Unix socket pair.

                        pfds[0].fd = l_EventFDs[tid][0];
                        pfds[0].events = POLLIN;
                        pfds[0].revents = 0;

Whenever poll signals a new event ...

                int rc = poll(pfds, count, timeout);

... the current IO thread reads.

                        if (pfds[0].revents & (POLLIN | POLLHUP | POLLERR)) {
                                char buffer[512];
                                if (read(l_EventFDs[tid][0], buffer, sizeof(buffer)) < 0)
                                        Log(LogCritical, "base", "Read from event FD failed.");
                        }

And once it has read enough, it calls DoEvents.

                                if (pfds[i].revents & (POLLIN | POLLHUP | POLLERR) || is_timeout) {

                                        if (!it->second->DoEvents()) {

DoEvents aka do something

                        int rc = read(m_FD, buffer, sizeof(buffer));

Note: This is the pipe created in the beginning and wrapped into the process spawn helper.

        m_Result.PID = m_PID;
        m_Result.ExecutionEnd = Utility::GetTime();
        m_Result.ExitStatus = exitcode;
        m_Result.Output = output;

        if (m_Callback)
                Utility::QueueAsyncCallback(std::bind(m_Callback, m_Result));

Process the check result

Back to lib/methods/pluginchecktask.cpp -

void PluginCheckTask::ProcessFinishedHandler(const Checkable::Ptr& checkable, const CheckResult::Ptr& cr, const Value& commandLine, const ProcessResult& pr)

This reads everything from the process result and creates a new check result being processed. Since this ran in the thread pool asynchronously, it may occur that it took quite a while on heavily stressed systems.

        String output = pr.Output.Trim();

        std::pair<String, String> co = PluginUtility::ParseCheckOutput(output);
        cr->SetCommand(commandLine);
        cr->SetOutput(co.first);
        cr->SetPerformanceData(PluginUtility::SplitPerfdata(co.second));
        cr->SetState(PluginUtility::ExitStatusToState(pr.ExitStatus));
        cr->SetExitStatus(pr.ExitStatus);
        cr->SetExecutionStart(pr.ExecutionStart);
        cr->SetExecutionEnd(pr.ExecutionEnd);

        checkable->ProcessCheckResult(cr);

At this point the whole check execution thingy is done. Magic, isn't it?

Changes

MaxConcurrentChecks

Limits the currently running processes, defaults to 512.

Sandboxed processes

2.6 introduced a process spawn helper which executes the checks in an isolated environment. This was to ensure that no file descriptors from the main process leak into the plugin's process.

This can be

This was implemented for solving #2847.

Solutions

dnsmichi commented 5 years ago

Calculate average service latency

var l = 0; var c = 0; for (s in get_objects(Service)) { if (s.last_check_result) { var ex_delta = s.last_check_result.execution_end - s.last_check_result.execution_start; var s_delta = s.last_check_result.schedule_end - s.last_check_result.schedule_start; var latency = s_delta - ex_delta; l += latency; c += 1 } }; var avg_latency = l / c; avg_latency

Example for 1000 services, check_interval = 10s calling a perl plugin which sleeps for 10s.

<112> => var res = {}; var l = 0; var c = 0; for (s in get_objects(Service)) { if (s.last_check_result) { var ex_delta = s.last_check_result.execution_end - s.last_check_result.execution_start; var s_delta = s.last_check_result.schedule_end - s.last_check_result.schedule_start; var latency = s_delta - ex_delta; l += latency; c += 1 } }; var avg_latency = l / c; avg_latency
0.012186
<113> => var res = {}; var l = 0; var c = 0; for (s in get_objects(Service)) { if (s.last_check_result) { var ex_delta = s.last_check_result.execution_end - s.last_check_result.execution_start; var s_delta = s.last_check_result.schedule_end - s.last_check_result.schedule_start; var latency = s_delta - ex_delta; l += latency; c += 1 } }; var avg_latency = l / c; avg_latency
0.012526
<114> => var res = {}; var l = 0; var c = 0; for (s in get_objects(Service)) { if (s.last_check_result) { var ex_delta = s.last_check_result.execution_end - s.last_check_result.execution_start; var s_delta = s.last_check_result.schedule_end - s.last_check_result.schedule_start; var latency = s_delta - ex_delta; l += latency; c += 1 } }; var avg_latency = l / c; avg_latency
0.013776
<115> => var res = {}; var l = 0; var c = 0; for (s in get_objects(Service)) { if (s.last_check_result) { var ex_delta = s.last_check_result.execution_end - s.last_check_result.execution_start; var s_delta = s.last_check_result.schedule_end - s.last_check_result.schedule_start; var latency = s_delta - ex_delta; l += latency; c += 1 } }; var avg_latency = l / c; avg_latency
0.013695
<116> => var res = {}; var l = 0; var c = 0; for (s in get_objects(Service)) { if (s.last_check_result) { var ex_delta = s.last_check_result.execution_end - s.last_check_result.execution_start; var s_delta = s.last_check_result.schedule_end - s.last_check_result.schedule_start; var latency = s_delta - ex_delta; l += latency; c += 1 } }; var avg_latency = l / c; avg_latency
0.013738

Get late service check results

get_objects(Service).filter(s => s.last_check < DateTime().value - 2 * s.check_interval).map(s => s.__name).len()
dnsmichi commented 5 years ago

7428 adds a CheckCommand object for the ITL for better performance analysis.

Things tried thus far

Open discussion

dnsmichi commented 5 years ago

@Al2Klimov Can you please add the findings from your Boost fiber analysis?

Al2Klimov commented 5 years ago
dnsmichi commented 5 years ago

Summary CW 35

From a technical view point, the structure how checks are executed and running is just fine. With the analysis above, specific things have become more clear:

Latency

If one or two of the above are true, check latency will increase. schedule_start is set when the check scheduler effectively inserts the task into the TP, schedule_end is set upon processing the check result.

Possible root causes

Plugin execution time.

If you run 1000 check_nwc_health in parallel, with a memory footprint of 3.3 mb for each Perl plugin, this takes a while to start, compile, execute and finish. Since this also includes SNMP tree iterations, such checks may be slower than the average ping.

If ping checks always run into a timeout, the default of 10 or 30 second applies. During this time the process is "stalled" in idle wait for ping to return. If you have 1000 tasks in the TP, and 1000 are pings which run into a timeout, you have the same problem as with a HTTP API or JSON-RPC server - it just won't accept new connections, or, in this case, it won't execute any more tasks.

Given that you may execute 200k services in a 5 to 10 minute interval, that's 20 to 40k per minute worst case. Let's stick with an average of 30k per minute, which renders this to

30000 / 60 = 500 per second

Granted that the max threads in the pool follow the CPU cores 2, resulting in e.g. 96 2 possible threads working on tasks, the general throughput depends on the check execution tasks to be really fast (< 1s). With the aforementioned timeouts, this is impossible to reach.

Threads vs CPU power

If you have many CPU cores, say 32, this allows for more threads in the thread pool. If those CPU cores are low end with say 800mhz, you won't win price compared to 4 cores with 3ghz. That gets even worse if context switches between threads become slower.

Check rate

If you would run 30k service checks per minute, this creates at least one cluster message for each check result being returned from the satellite to the master (and also the secondary satellite). The memory footprint of the CR mainly depends on the plugin output and performance data, and may range from a few bytes to many KBs too.

If you're not on a low profile bandwidth, this shouldn't harm the network layers to just send over the TCP packages resulting in JSON-RPC messages. Still, the messages are wrapped into JSON which needs to be encoded/decoded on each end. Also, when receiving such a message, it needs to be processed in order.

That being said, next to the checker's thread pool being overloaded, the cluster may be overloaded as well. You can measure this by extracting the metric from /v1/status/ApiListener inside the json_rpc struct. This also is returned with the icinga built-in check. If that's on a steep curve and doesn't lower itself over time, you really have a performance issue.

You can mitigate this even further with analysing the system's performance on context switching, IO and thread consuming for the process - e.g. with the added sar_perf CheckCommand in #7428.

Check Scheduler

Actually, no. Stop blaming the scheduler for everything. It isn't an AI nor does it provide machine learning capabilities. Though it has some intelligent scheduling inside to prevent problematic situations like:

Users tend to claim that check results are old after firing a reload, but this is just a spread check execution and won't be visible again after running for some time.

With solving the plugin execution time and making things fast and efficient, the core's scheduler will perform as intended. This also is documented, and if you have a better implementation, feel free to propose this in a detailed concept.

https://icinga.com/docs/icinga2/snapshot/doc/19-technical-concepts/#check-scheduler

Ideas like learning from check execution time and putting them into a low latency group or defined check group are just fine, but need to prove themselves not to cause any other harm.

Also, these ideas are not for a standalone core as we had with 1.x but are related to HA enabled zones and in addition, multiple levels of a distributed environment.

Max Concurrent Checks

The actually running checks differ from what is really possible. The MaxConcurrentChecks setting also takes the workload of the thread pool into account.

It doesn't make sense to put something into the full TP, rather checks are delayed. This also increases the check latency. Therefore this setting may not be reached at all if the underlaying performance suffers.

Conclusion

Users should do the following:

Developers will do the following for future releases after 2.11:

We've agreed on removing this task from 2.11, since the performance compared to older versions does not suffer. One performance problem in an aforementioned environment was caused with the downtime loop which we have mitigated and fixed already.

Therefore I am putting this into 2.12, as ongoing task.

Note

There is absolutely no guarantee that this specific issue gets resolved in this version. We will "fork" specific work tasks from this tracking issue, and carefully decide which problems to tackle and which feature to implement. The issue's state might not always be current too.

dnsmichi commented 5 years ago

Number of Open Files for Pipes

Set the Systemd limits

mkdir /etc/systemd/system/icinga2.service.d
cat >/etc/systemd/system/icinga2.service.d/limits.conf <<EOF
[Service]
LimitNOFILE=50000
LimitNPROC=50000
TasksMax=infinity
EOF

Modify Icinga's RLIMIT

Debian:

vim /etc/default/icinga2

RHEL

vim /etc/sysconfig/icinga2

Content:

ICINGA2_RLIMIT_FILES=50000

Reload Systemd and Icinga

systemctl daemon-reload
systemctl restart icinga2

Verify

for p in $(pidof icinga2); do echo -e "$p\n" && ps -ef | grep $p && echo && cat /proc/$p/limits | grep 'open files' && echo; done

for p in $(pidof icinga2); do echo -e "$p\n" && ps -ef | grep $p && echo && lsof -p $p && echo; done