falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
231 stars 164 forks source link

fd.name some times contain bogus <UNKNOWN> part in "Write below root" event #1892

Open petterreinholdtsen opened 8 months ago

petterreinholdtsen commented 8 months ago

Some times the events from sandbox rule "Write below root" contain a incomplete path, making it hard to know the location of the file in question. This is an example message:

"14:27:16.402853438: Error File below / or /root opened for writing (file=\<UNKNOWN>cf_lock.lmdb.lock evt_type=openat user=root user_uid=0 user_loginuid=-1 process=cf-agent proc_exepath=/usr/bin/cf-agent parent=cf-execd command=cf-agent terminal=0 exe_flags=O_RDWR container_id=host container_name=host)",

The process triggering the event is cfengine with its lock file, and in other similar events the fd.name value is "//cf_lock.lmdb.lock", so I suspect the real value of the \<UNKNOWN> part should be "/".

This is running on a RHEL 9.3 machine using the "modern" ebpf probe, with a setup where all rule sets from https://github.com/falcosecurity/rules/ are enabled. The machine uses cfengine to keep the setup correct and up to date.

I expected fd.name to contain a full path to the opened file, and suspect there is a bug somewhere causing the \<UNKNOWN> part.

Tue Feb 27 14:35:54 2024: Falco version: 0.37.1 (x86_64)
Tue Feb 27 14:35:54 2024: Falco initialized with configuration file: /etc/falco/falco.yaml
Tue Feb 27 14:35:54 2024: System info: Linux version 5.14.0-284.11.1.el9_2.x86_64 (mockbuild@x86-vm-09.build.eng.bos.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) falcosecurity/falco#1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023
{"default_driver_version":"7.0.0+driver","driver_api_version":"8.0.0","driver_schema_version":"2.0.0","engine_version":"31","engine_version_semver":"0.31.0","falco_version":"0.37.1","libs_version":"0.14.3","plugin_api_version":"3.2.0"}

Tue Feb 27 14:36:59 2024: Loading rules from file /etc/falco/rules.d/99-settings.yaml Tue Feb 27 14:36:59 2024: /etc/falco/rules.d/99-settings.yaml: Ok, with warnings 1 Warnings: In rules content: (/etc/falco/rules.d/60-nidhogg.yaml:0:0) macro 'allowed_tmp_files': (/etc/falco/rules.d/60-nidhogg.yaml:41:2)

{ "machine": "x86_64", "nodename": "soria.uio.no", "release": "5.14.0-284.11.1.el9_2.x86_64", "sysname": "Linux", "version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023" }


- Cloud provider or hardware configuration:
- OS: Red Hat Enterprise Linux 9.3
- Kernel: Linux some-host 5.14.0-284.11.1.el9_2.x86_64 falcosecurity/falco#1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
- Installation method: RPM
incertum commented 8 months ago

Thanks @petterreinholdtsen:

Just to clarify

Some times the events from sandbox rule "Write below root" contain a incomplete path

is truly sometimes, not all the times?

Would you mind observing again with the following additional output fields? %fd.nameraw %fs.path.name %fs.path.nameraw %proc.cwd (https://falco.org/docs/reference/rules/supported-fields/) -> are all of these fields missing?

In addition, is it some new issues or have you observed this before in Falco? Plus could you check if you have a decent amount of dropped open* syscalls leveraging https://falco.org/docs/metrics/falco-metrics/, see also https://falco.org/docs/troubleshooting/dropping/.

petterreinholdtsen commented 8 months ago

[Melissa Kilby]

is truly sometimes, not all the times?

Yes. Here is an example with // instead of :

12:26:02.996402180: Error File below / or /root opened for writing (file=//cf_lock.lmdb.lock evt_type=openat user=root user_uid=0 user_loginuid=-1 process=cf-agent proc_exepath=/usr/bin/cf-agent parent=cf-execd command=cf-agent terminal=0 exe_flags=O_RDWR container_id=host container_name=host)

This is on the same machine.

Would you mind observing again with the following additional output fields? %fd.nameraw %fs.path.name %fs.path.nameraw %proc.cwd (https://falco.org/docs/reference/rules/supported-fields/) -> are all of these fields missing?

I will see what I can do. I do not have direct access to any of the problematic machines, only the logs from them, but will try to figure out a way to test a modified setup one one of the machines with this behavior.

In addition, is it some new issues or have you observed this before in Falco? Plus could you check if you have a decent amount of dropped open* syscalls leveraging https://falco.org/docs/metrics/falco-metrics/, see also https://falco.org/docs/troubleshooting/dropping/.

In total for the last event bulk I see 18 "Falco internal: syscall event drop" events across 5 machines and 8000 events. All of them happened on one of the machines with the path, but the path show up on more than one machine.

-- Happy hacking Petter Reinholdtsen

incertum commented 8 months ago

@petterreinholdtsen would it be possible to instead use the new https://falco.org/docs/metrics/falco-metrics/ framework as it supports monotonic counters? That way we know more accurately how many dropped events there are since the beginning of launching Falco and more. Thanks!

Looking forward to hearing back from you in regards to the other fields %fd.nameraw %fs.path.name %fs.path.nameraw %proc.cwd.

petterreinholdtsen commented 8 months ago

[Melissa Kilby]

@petterreinholdtsen would it be possible to instead use the new https://falco.org/docs/metrics/falco-metrics/ framework as it supports monotonic counters?

I am reluctant to add this to all our hosts, as we try to reduce the noise and number of irrelevant events in the log, and the metrics seem add regular quite a lot of "non-events' to the log.

Looking forward to hearing back from you in regards to the other fields %fd.nameraw %fs.path.name %fs.path.nameraw %proc.cwd.

This change rolled out this weekend, and here are a few example log events, one with and one with a more path like file value.

Error File below / or /root opened for writing (file=cf_lock.lmdb.lock nameraw=cf_lock.lmdb.lock pathname=/etc/sssd/cf_lock.lmdb.lock pathnameraw=cf_lock.lmdb.lock proccwd=/etc/sssd/ evt_type=openat user=root user_uid=0 user_loginuid=-1 process=cf-agent proc_exepath=/usr/bin/cf-agent parent=cf-execd command=cf-agent terminal=0 exe_flags=O_RDWR container_id=host container_name=host)

Error File below / or /root opened for writing (file=//cf_lock.lmdb.lock nameraw=cf_lock.lmdb.lock pathname=/var/cfengine/state/cf_lock.lmdb.lock pathnameraw=cf _lock.lmdb.lock proccwd=/var/cfengine/state/ evt_type=openat user=root user_uid=0 user_loginuid=-1 pr ocess=cf-agent proc_exepath=/usr/bin/cf-agent parent=cf-execd command=cf-agent terminal=0 exe_flags=O _RDWR container_id=host container_name=host)

-- Happy hacking Petter Reinholdtsen

petterreinholdtsen commented 8 months ago

[Petter Reinholdtsen]

This change rolled out this weekend, and here are a few example log events, one with and one with a more path like file value.

Perhaps the relevant difference is the file permission of the cwd directory?

drwx------. 1 root root 36 jan. 12 01:00 /etc/sssd/ drwxr-xr-x. 1 root root 318 jan. 18 12:37 /var/cfengine/

Even with the latter, '//' seem like the wrong path.

-- Happy hacking Petter Reinholdtsen

incertum commented 8 months ago

Re https://falco.org/docs/metrics/falco-metrics/ you can adjust the timer aka how often you want to emit a log. Just FYI.

Error File below / or /root opened for writing (file=cf_lock.lmdb.lock nameraw=cf_lock.lmdb.lock pathname=/etc/sssd/cf_lock.lmdb.lock pathnameraw=cf_lock.lmdb.lock proccwd=/etc/sssd/ evt_type=openat user=root user_uid=0 user_loginuid=-1 process=cf-agent proc_exepath=/usr/bin/cf-agent parent=cf-execd command=cf-agent terminal=0 exe_flags=O_RDWR container_id=host container_name=host)

Thank you for deploying the additional output fields 🙏 !

@LucaGuerra and @FedeDP given that pathname=/etc/sssd/cf_lock.lmdb.lock is correct (using the same new custom path concatenation method) something must be wonky wrt to how we store the cwd to then concat the path for fd.name since for the fs.path filters we fetch the args directly. Or something else in the fd. filtercheck logic could be wonky.

@petterreinholdtsen this needs a patch in Falco's libs, ETA unknown and we should consider releasing a patch release. We will keep you updated. @falcosecurity/falco-maintainers @leogr

incertum commented 8 months ago

Given your examples above https://github.com/falcosecurity/libs/issues/1892 effectively the derivation of fd.name is more sophisticated and perhaps fs.path.name is actually wrong here. fs.path.name always concatenates the proc.cwd and the raw open* syscall arg wrt to opening a file, but it may be inaccurate. In addition, assuming cf_lock.lmdb.lock in both logs you shared are at the same location you see that the full path is different when looking at fs.path.name.

CC @FedeDP who shared a similar assessment. In summary, I am not sure if there is something to fix easily. At the same time if fs.path.name is wrong we should do something about that and try to make them equal?

@petterreinholdtsen did you observe a regression between release or is this just a new observation? Furthermore, could you check for example what the true location of cf_lock.lmdb.lock is on your system? Percentage wise how rare is the observation of <UNKNOWN> or // at the beginning?

Code in question:

More references:

poiana commented 5 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

petterreinholdtsen commented 5 months ago

[Melissa Kilby]

@petterreinholdtsen did you observe a regression between release or is this just a new observation? Furthermore, could you check for example what the true location of cf_lock.lmdb.lock is on your system? Percentage wise how rare is the observation of <UNKNOWN> or // at the beginning?

I just noticed in during testing. Do not know how often it happen.

-- Happy hacking Petter Reinholdtsen

leogr commented 5 months ago

/remove-lifecycle stale

This issue should be fixed in libs. I'm transferring this issue to the libs repo.

leogr commented 5 months ago

cc @FedeDP @Andreagit97

poiana commented 2 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

petterreinholdtsen commented 2 months ago

I'm still waiting for a resolution.

-- Happy hacking Petter Reinholdtsen

FedeDP commented 2 months ago

/remove-lifecycle stale

Sorry we had no time to look into this during this release cycle; setting next milestone: /milestone 0.19.0

FedeDP commented 1 day ago

/milestone 0.20.0