falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.1k stars 877 forks source link

Metrics don't follow Prometheus best practices #3271

Open Issif opened 6 days ago

Issif commented 6 days ago

Describe the bug

Hi,

I was able to collect the Falco metrics with Grafana Alloy and forward them to my Grafana Cloud account:

I'm noticing anyway some issues with the metrics.

First, they are not following the best practices of Prometheus. For example, we have a metric by rule loaded in Falco, and not a metric with label to create series:

falcosecurity_falco_rules_Polkit_Local_Privilege_Escalation_Vulnerability_CVE_2021_4034_total{raw_name="rules.Polkit_Local_Privilege_Escalation_Vulnerability_CVE_2021_4034"}
falcosecurity_falco_rules_Java_Process_Class_File_Download_total{raw_name="rules.Java_Process_Class_File_Download"}
falcosecurity_falco_rules_Modify_Container_Entrypoint_total{raw_name="rules.Modify_Container_Entrypoint"}
falcosecurity_falco_rules_Decoding_Payload_in_Container_total{raw_name="rules.Decoding_Payload_in_Container"}
falcosecurity_falco_rules_Basic_Interactive_Reconnaissance_total{raw_name="rules.Basic_Interactive_Reconnaissance"}
...

The issues with that pattern are:

The good practice would be to create a metric and use labels:

falcosecurity_falco_rules_total{rule_name="Basic_Interactive_Reconnaissance"}
falcosecurity_falco_rules_total{rule_name="Privilege_Escalation_Vulnerability_CVE_2021_4034"}
falcosecurity_falco_rules_total{rule_name="Java_Process_Class_File_Download"}
falcosecurity_falco_rules_total{rule_name="Modify_Container_Entrypoint"}
...

This pattern, happens with a lot of different metrics: These:

falcosecurity_falco_memory_rss_bytes{raw_name="memory_rss_bytes"}
falcosecurity_falco_memory_vsz_bytes{raw_name="memory_vsz_bytes"}
falcosecurity_falco_memory_pss_bytes{raw_name="memory_pss_bytes"}

Should be:

falcosecurity_falco_memory_bytes{type="rss"}
falcosecurity_falco_memory_bytes{type_name="vsz"}
falcosecurity_falco_memory_bytes{type_name="pss"}

etc, etc

Another issue is the labels are not relevant at all, we have raw_name for all series, when they are supposed to give context. See my example:

falcosecurity_falco_falco_sha256_rules_file_override_rules_info{raw_name="falco_sha256_rules_file_override_rules"
falcosecurity_falco_falco_sha256_rules_file_falco_sandbox_rules_info{raw_name="falco_sha256_rules_file_falco_sandbox_rules"
falcosecurity_falco_falco_sha256_rules_file_falco_incubating_rules_info{raw_name="falco_sha256_rules_file_falco_incubating_rules"
falcosecurity_falco_falco_sha256_rules_file_falco_rules_info{raw_name="falco_sha256_rules_file_falco_rules"

Should be:

falcosecurity_falco_sha256_rules_info{rule_file="override_rules"}
falcosecurity_falco_sha256_rules_info{rule_file="falco_sandbox_rules"}
falcosecurity_falco_sha256_rules_info{rule_file="falco_incubating_rules"}
falcosecurity_falco_sha256_rules_info{rule_file="file_falco_rules"}

It's important to have clear labels, and possibly the same from one metric to another to allow to create filters in the dashboards. They are also used to create dynamic graphs (thanks to group_by).

I also think, more labels should be added, to allow to filter all graphs easily, see what I did in Falcosidekick:

falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="azure-ip-masq-agent-bhsqt",priority="Informational",rule="Launch Sensitive Mount Container"} 1
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="cloud-node-manager-jxg2r",priority="Informational",rule="Launch Sensitive Mount Container"} 2
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azuredisk-node-qctfm",priority="Informational",rule="Launch Privileged Container"} 2
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azuredisk-node-qctfm",priority="Informational",rule="Launch Sensitive Mount Container"} 3
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azurefile-node-d28zb",priority="Informational",rule="Launch Privileged Container"} 2

It allows to scope to a namespace, a priority, etc.

How to reproduce it

Enable the metrics and scrape the /metrics endpoint.

Expected behaviour

Screenshots

image

Environment

Additional context

incertum commented 6 days ago

Hey @Issif no opinion re what's best to create better dashboards and other benefits. Note that we also have the output rules in JSON where each metric is it's own key, which explains a bit why the rules counter etc are their own metric.

Makes sense:

I would not change:

@Issif do you want a patch release to address (1) and (2)? (3) would only work for the next release. CC @leogr

incertum commented 6 days ago

@Issif

One additional question: the rules counter always had bunch of labels including the original rule name, you did see them? As you didn't include them above in your example outputs.

slight deviation in naming suggestion, but exact idea you proposed @Issif for items under 1.) and 2.)

# HELP falcosecurity_falco_falco_sha256_rules_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_rules_files_info gauge
falcosecurity_falco_falco_sha256_rules_files_info{raw_name="falco_sha256_rules_files",file_name="falco_rules",sha256="58a4f187b6b04b43ae938132325cbbb6b2bb9eb4e76e553f5b4b7b5b360ee0b4"} 1
# HELP falcosecurity_falco_falco_sha256_rules_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_rules_files_info gauge
falcosecurity_falco_falco_sha256_rules_files_info{raw_name="falco_sha256_rules_files",file_name="falco_rules test",sha256="e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"} 1
# HELP falcosecurity_falco_falco_sha256_rules_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_rules_files_info gauge
falcosecurity_falco_falco_sha256_rules_files_info{raw_name="falco_sha256_rules_files",file_name="falco-sandbox_rules",sha256="90a168fcbd3e5106f2e9cea34b33d1caa64c9d2c648f8b243bef64f36f3099ea"} 1
# HELP falcosecurity_falco_falco_sha256_rules_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_rules_files_info gauge
falcosecurity_falco_falco_sha256_rules_files_info{raw_name="falco_sha256_rules_files",file_name="falco-incubating_rules",sha256="ff12f8bb19b85c5b61fc28cc46ee5b1d47e6512fe932fc0486f98b3546db681e"} 1
# HELP falcosecurity_falco_falco_sha256_rules_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_rules_files_info gauge
falcosecurity_falco_falco_sha256_rules_files_info{raw_name="falco_sha256_rules_files",file_name="falco-deprecated_rules",sha256="a977dfeb6acf6ad85d6d50420894681c0fec07415eb8bf774942076b12c86031"} 1
# HELP falcosecurity_falco_falco_sha256_rules_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_rules_files_info gauge
falcosecurity_falco_falco_sha256_rules_files_info{raw_name="falco_sha256_rules_files",file_name="falco_rules",sha256="58a4f187b6b04b43ae938132325cbbb6b2bb9eb4e76e553f5b4b7b5b360ee0b4"} 1
# HELP falcosecurity_falco_falco_sha256_config_files_info https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_falco_sha256_config_files_info gauge
falcosecurity_falco_falco_sha256_config_files_info{raw_name="falco_sha256_config_files",file_name="falco",sha256="4638f808f51021448d866309f393317aa61bc2b4ce42fa40237c3494ecd84bff"} 1

# HELP falcosecurity_falco_rules_matches_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_rules_matches_total counter
falcosecurity_falco_rules_matches_total{raw_name="rules_matches_total"} 10
# HELP falcosecurity_falco_rules_counters_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_rules_counters_total counter
falcosecurity_falco_rules_counters_total{raw_name="rules_counters",priority="4",rule_name="Read sensitive file untrusted",source="syscall",tags="T1555, container, filesystem, host, maturity_stable, mitre_credential_access"} 10

rules_counters now only exported for rules with a counter greater than O.

As pointed out the hostname should not be a label and its own metric aka it remains the same, for example:

# HELP falcosecurity_evt_hostname_info https://falco.org/docs/metrics/
# TYPE falcosecurity_evt_hostname_info gauge
falcosecurity_evt_hostname_info{raw_name="hostname",hostname="test"} 1
Issif commented 5 days ago

One additional question: the rules counter always had bunch of labels including the original rule name, you did see them? As you didn't include them above in your example outputs.

Just missed them when I cropped the lines.

As pointed out the hostname should not be a label and its own metric aka it remains the same, for example:

For the counters for memory, it's ok, they are not dynamic like the one for the rules. Good point for the hostnames too. Thanks @incertum

incertum commented 5 days ago

@Issif mind checking above example outputs once more for correctness. Re-pushed to not accidentally remove the sha256 ...

Issif commented 4 days ago

Your proposal seems good to me.

I'm just wondering if this metric is useful:

# HELP falcosecurity_falco_rules_matches_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_rules_matches_total counter
falcosecurity_falco_rules_matches_total{raw_name="rules_matches_total"} 10

With

# HELP falcosecurity_falco_rules_counters_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_rules_counters_total counter
falcosecurity_falco_rules_counters_total{raw_name="rules_counters",priority="4",rule_name="Read sensitive file untrusted",source="syscall",tags="T1555, container, filesystem, host, maturity_stable, mitre_credential_access"} 10

Without specifying a label, the count will be the same.

incertum commented 4 days ago

We have it for the JSON output rules. We can remove it for Prometheus if you want given the refactor.

leogr commented 2 days ago

@Issif do you want a patch release to address (1) and (2)? (3) would only work for the next release. CC @leogr

Since metrics feature is "incubating", I don't see any urgency to deliver the fixes. So, I'd avoid an hotfix release, unless we have compelling reasons to do so.

leogr commented 2 days ago

I also think, more labels should be added, to allow to filter all graphs easily, see what I did in Falcosidekick:

falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="azure-ip-masq-agent-bhsqt",priority="Informational",rule="Launch Sensitive Mount Container"} 1
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="cloud-node-manager-jxg2r",priority="Informational",rule="Launch Sensitive Mount Container"} 2
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azuredisk-node-qctfm",priority="Informational",rule="Launch Privileged Container"} 2
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azuredisk-node-qctfm",priority="Informational",rule="Launch Sensitive Mount Container"} 3
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azurefile-node-d28zb",priority="Informational",rule="Launch Privileged Container"} 2

It allows to scope to a namespace, a priority, etc.

I agree on this. falco-exporter provides similar behavior: https://github.com/falcosecurity/falco-exporter/blob/master/pkg/exporter/exporter.go#L29-L43

incertum commented 1 day ago

I also think, more labels should be added, to allow to filter all graphs easily, see what I did in Falcosidekick:

falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="azure-ip-masq-agent-bhsqt",priority="Informational",rule="Launch Sensitive Mount Container"} 1
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="cloud-node-manager-jxg2r",priority="Informational",rule="Launch Sensitive Mount Container"} 2
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azuredisk-node-qctfm",priority="Informational",rule="Launch Privileged Container"} 2
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azuredisk-node-qctfm",priority="Informational",rule="Launch Sensitive Mount Container"} 3
falco_events{hostname="aks-agentpool-10286953-vmss000000",k8s_ns_name="kube-system",k8s_pod_name="csi-azurefile-node-d28zb",priority="Informational",rule="Launch Privileged Container"} 2

It allows to scope to a namespace, a priority, etc.

I agree on this. falco-exporter provides similar behavior: https://github.com/falcosecurity/falco-exporter/blob/master/pkg/exporter/exporter.go#L29-L43

Hey @leogr which additional labels are you referring to?

Re the hostname, please see https://github.com/falcosecurity/falco/issues/3271#issuecomment-2206918760 or https://github.com/falcosecurity/libs/blob/0ec2ad8422b2ac06a034f335c7768c5ab0d13736/userspace/libsinsp/metrics_collector.h#L160-L164 we have it as separate / own metric following the recommendations in the documentation.

Re pod name and namespace name: Would you mind getting me up to speed how you would retrieve this in Falco? IMO given that they can change per event it may be best suited to do that analyses based on the logs in your data lake and instead focus on the simple rules counters in Prometheus. The Falco rules counters are currently not partitioned by namespace or pod name. I can imagine that such granular groupBy queries can become very large in large deployments?