Perfdata processing can't catch up

mathiasaerts commented 1 year ago

Hi,

We have an Icinga 2 setup with a lot of hosts and about 78000 checks in total. We've had the perfdata writer in Icinga enabled for quite some time before installing this module. However, it seems the processing is not fast enough to keep up with incoming data, and processing existing perfdata files seems to slow to catch up. I've already removed perfdata files older than 1 month, but still it seems that it is too much data to process.

When looking at the code, it seems that all files must be processed chronologically. Is this a technical limitation of RRD, or would it be possible to process files in a random order? If that's the case, perhaps multiple threads/procs can be used to process the data faster. If this is not possible, I was wondering if it would be an option to process host and service perfdata files separately (perhaps via a command line parameter --perfdata <all (default) | host | service>). This way, we could run 2 processes in parallel and crunch through the data faster, and we should hopefully also be able to keep up with the amount of incoming data.

Also, I noticed the runtime limit is currently hardcoded. I understand this is necessary to prevent overlap when running the command every minute using cron, but it also limits the throughput for processing a backlog like ours. It would be nice if maximum runtime could be specified on the command line --max-runtime <60 (default) | 0 (disabled) | N seconds>. I might go on and disable the cronjob and run the command in a while true loop, but this still has some overhead for scanning the perfdata directory on each run.

I'm just looking for ways to increase performance and make this module work for our use case.

Best regards, Mathias

Virsacer commented 1 year ago

Hi,

yes, data must be processed chronologically. Otherwise RRD will show a message like this: illegal attempt to update using time 1669260084 when last update time is 1669260084 (minimum one second step)

So processing host and service data in parallel could speed things up, but it would most likely mess up the statistics. Also the actual benefit depends on the ratio of hosts to services.

But sure, having some kind of "bulk mode" seems like a good idea for processing a big backlog.

One thing you didnt mention: Are you using the php rrd module? In my case this is A LOT faster than using the binary.

mathiasaerts commented 1 year ago

I thought that would be the case. I've already spotted these errors in the log as well. Not sure how there would be multiple results for a single check at the same epoch timestamp. I did notice that in some cases, the command might run longer than 60s, and it could overlap. Perhaps this happens when it is processing a large perfdata file, since it only checks runtime in between files.

Why do you think processing host and service perfdata would mess up the statistics? Looking at the xml and rrd files, it seems that host data is stored separately in _HOST_ files. We have about 35 checked services per host, so maybe splitting it up is not that beneficial. There are equal amounts of perfdata files though, but service files will definitely be a lot larger.

Yes, I have the PHP RRD extension installed. A green checkmark is shown in the module settings, so I'm assuming it is using that. I don't even have the rrdtool binary installed AFAIK.

Virsacer commented 1 year ago

I still have these messages sometimes. See also https://github.com/Icinga/icinga2/issues/9405

I was talking about the processing statistics /rrdtool/graph?host=.pnp-internal, but on second thought it wont be that bad. PerfdataWriter rotates the host and service files at the same time, but the number of lines will vary.

Ok, then it is already "fast"... When perfdata is processed, lots of files will be written - maybe IO is a bottleneck? Can you share the statistics-graphs?

mathiasaerts commented 1 year ago

Interesting.. Could it have anything to do with the service_format_template ? In Icinga docs, they're using $icinga.timet$ , while in the readme for this repo, it says $service.last_check$ . I currently did not set this in the Icinga configuration, so it should be using the default value.

In any case, good point about IO bottleneck, I didn't check this yet. The Icinga master is running on DigitalOcean, and I had previously migrated perfdata to a block volume since it was becoming too large for local storage. However, this is not ideal performance-wise. So I resized the droplet and converted it back to local storage, which did increase performance a lot. Looking at the amount of perfdata files, it does seem that the total amount is slowly decreasing now.

I'm also running the process command in a loop now, which reduces idle time.

Block storage:

Local storage:

Virsacer commented 1 year ago

Oh, I did not notice that - have to try it...

Thats good news :-) The average number of written rrd-files increased ~5 times. So thats ~1120 checkresults written to ~900 rrd-files per second. We have a much smaller environment with ~1120 checkresults written to ~1830 rrd-files per second.

Also you have a lot of skipped rows. You should set enable_perfdata = false on checks, that do not actually generate perfdata.

Virsacer commented 1 year ago

Icinga uses $service.last_check$ by default: https://github.com/Icinga/icinga2/blob/master/lib/perfdata/perfdatawriter.ti I saw the message also when using $icinga.timet$ , so it does not seem to matter...

Virsacer / icingaweb2-module-rrdtool

Perfdata processing can't catch up #9