Open nikgoodley-ibboost opened 1 year ago
A friendly reminder that this issue had no activity for 30 days.
@mheon PTAL
The ENOENT bit is entirely normal? Podman's daemonless model means we often have multiple processes racing to do the same things, which we handle internally - hence no errors to the user. Logging these to audit will, indeed, produce rather stupendous volumes of logs, but that seems a logical result of an audit rule that logs every delete event from users?
EISDIR is more concerning, that sounds like something is coded wrong somewhere, I can take a look.
Thanks Matt. Completely agree about the volume of audited calls for deletions being a natural consequence, I was more giving the context for why something that due its nature we'd want to run frequently has an impact (for reference the CIS benchmark rule is 4.1.4 on the RHEL 8 version page 308+ in https://github.com/cismirror/old-benchmarks-archive/blob/master/CIS_Red_Hat_Enterprise_Linux_8_Benchmark_v1.0.0.pdf). I'd love not to have to follow that rule, but in many cases we're obliged to.
Thanks for the comment on EISDIR. I had assumed these were all related, which is why I thought it worth listing all. On ENOENT (and ENOTEMPTY which I'm not sure if you noticed) though, could well be I misunderstand the point on racing but don't they suggest either it's trying to delete something without checking it's there (which could be considered moot, I know but seems a natural and trivial safeguard that would have the benefit of not creating a commonly audited syscall), or it's trying the wrong operation given the object type?
In the particular use case we had, the boxes are essentially static content servers and so we see essentially no change in state beyond logs and it just so happened the healthcheck was responsible for the vast majority of total change in the system and it wasn't obvious why there's a need to be working with temporary overlay structures for something like a basic curl or hostname
(I say this ignorant of internal mechanics).
If that's intentional and the best way to do it, no problems, it was just the fact that there were also these multiple delete failures suggesting something was off so if there was a chance to reduce the audit load at the same time it would be a bonus.
On ENOENT meaning we're not checking for existence - that's generally correct. Even if we did check, the file could be removed from under us between the check and the removal, so we have to check/handle ENOENT in the code anyways - it's generally easier to not check for existence at all, and just handle ENOENT.
I think we have to perform a directory removal when cleaning up an exec session, which seems like the likely cause of the EISDIR/ENOTEMPTY errors, I'll look into that.
Fair enough, understood on that aspect, thanks
A friendly reminder that this issue had no activity for 30 days.
@mheon any update on this?
Issue Description
The healthchecks (and conmon?) seem to be attempting invalid delete syscalls every iteration of the healthcheck. The adds a fair amount of noise (around 40% more) entries into
/var/log/audit/audit.log
with hundreds of MB of log entries per day for only one single basic healthcheck running on a 5s interval.This may seem pedantic but due to established security audit rules it adds up to a lot of volume. We follow Linux hardening practices, which includes the following audit rules (as referenced, for example in the CIS RHEL audit rules https://www.tenable.com/audits/items/CIS_Red_Hat_EL7_STIG_v2.0.0_STIG.audit:41df435215c6a2bb28dc7d7665c637d9)
Whilst this is always going to be quite heavy in some situations, with most applications it's not material, and if a file has to be deleted, that's just how it is. However as 5 of of the audit entries every cycle actually relate to unsuccessful deletes for non-existent files, operations not expecting a directory and non-empty directories, it's hinting that a lot of the volume might be avoided by checking the state of the target directory/file. Of course generally minimising the footprint under typical CIS rules would be very much appreciated, but it's the volume of unsuccessful calls that caught the eye.
There seem to be two separate aspects with
conmon
andpodman healthcheck
both trying to delete a directory based on a UUID that is created but apparently deleted by some other step (presumably?). Later audit entries indicate things are cleaned up though the sequence isn't obvious to me.Steps to reproduce the issue
Steps to reproduce the issue
podman run -itd --rm --name=busy --healthcheck-command '/bin/sh -c "hostname"' --healthcheck-interval 5s busybox
ausearch --start today -m syscall -sv no -i | tail -200
Describe the results you received
This is the interpreted audit output every healthcheck cycle (i.e. every 5 seconds in this example repro case).
The entries in audit.log create upwards of half a gigabute of log, per 5s-interval container healthcheck, per day.
For conmon it seems to be looking for a UUID that exists transiently, i.e. repeatedly listingn the directory shows this did exist for a second
but audit suggests it's trying to delete it after its gone
Describe the results you expected
I expect containers running, essentially at idle, even with healthchecks to not create vast amounts of audit logs, or if it is to attempt to delete files, for them to be confirmed valid targets before attempting (if this is the reason for the multiple failed syscalls).
Given the need to run in compliant secure environments it would be much appreciated if the footprint of the healthchecks could be minimised.
podman info output
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
No
Additional environment details
This example is on AWS as described above but seems general
Additional information
So far it looks like this is the behaviour of every healthcheck but we run strictly rootless at the moment.