Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.98k stars 571 forks source link

ConfigPackageUtility misses atomic file operations and #10055

Open julianbrost opened 1 month ago

julianbrost commented 1 month ago

In https://github.com/Icinga/icinga2/blob/v2.14.2/lib/remote/configpackageutility.cpp, files are written in-place using a basic std::ofstream. So if the Icinga 2 process is terminated while a file is being written, this can result in an incomplete file that will still be picked up when the configuration is loaded, resulting in a possibly incomplete or otherwise broken configuration. Additionally, as no fsync() or similar is done, other corruption may happen if the whole system crashes (I think we hat such problems in the past on Windows in particular).

Some of the std::ofstream should probably simply be replaced by our own AtomicFile, for example when writing files like /var/lib/icinga2/api/packages/director/active.conf.

However, if AtomicFile would be used everywhere, especially where the actual config package contents are written, this will probably slow down writing all the files quite a bit as it would enforce a sequential "write one file, sync it to disk, write the next one, ..." pattern. Keeping the std::ofstream there and syncing all files afterwards should allow the operating system to combine the sync operations, making them cheaper.[^1] The whole config stage directory tree can't be written atomically, so the atomic commit at the end should be replacing the active stage.

There have been a few reports of broken config packages in the past, the problem described here probably is their cause.

ref/IP/53240[^2]

[^1]: I did not specifically test this, but I expect that if fsync() calls are bundled at the end, some of them may not even have to do anything as the file was flushed in the meantime already. At least that optimization is possible, unlike when AtomicFile is used sequentially. [^2]: I noticed this problem while looking into this support ticket, but I couldn't confirm it as the likely cause.