falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.16k stars 884 forks source link

Falco 0.33.1 "node name does not correspond to a node in the cluster" during startup due to jq filter failure on NotReady node with status.addresses missing #2358

Closed wuub closed 1 year ago

wuub commented 1 year ago

Describe the bug

When starting falco on EKS with:

        - '--k8s-node'
        - $(FALCO_K8S_NODE_NAME)

we've experienced whole DaemonSet failures (all new pods failing to start/restart) reporting errors like: Error fetching K8s data: Failing to enrich events with Kubernetes metadata: node name does not correspond to a node in the cluster: ip-xxx-yy-zzz-www.us-west-1.compute.internal. After some looking around and enabling libs_logger.enabled: true we've been able to narrow it down to: https://github.com/falcosecurity/libs/blob/01c07df720708f19b6ba3e2f6857bddb8c2c4779/userspace/libsinsp/socket_handler.h#L792

causing this error line:

[libs]: Socket handler (k8s_node_handler_state), [https://172.20.0.1] filter processing error "json_query filtering result invalid."; JSON: <{"kind":"NodeList","apiVersion":"v1","metadata":{[HUMONGOUS-API-RESPONSE]}}>, jq filter: <{ type: "ADDED", apiVersion: .apiVersion, kind: "Node",  items: [  .items[] |   {   name: .metadata.name,   uid: .metadata.uid,   timestamp: .metadata.creationTimestamp,   labels: .metadata.labels,   addresses: [.status.addresses[].address] | unique   } ]}>

While digging more, this failure is caused by a NonReady node being returned which does not present any .addresses in the .status field:

example:

{
    "metadata": {
        "name": "ip-10-5-13-255.us-west-1.compute.internal",
    },
    // ....
    "status": {
        "conditions": [
          /// ...
        ],
        "daemonEndpoints": {
            "kubeletEndpoint": {
                "Port": 0
            }
        },
        "nodeInfo": {
            "machineID": "",
            "systemUUID": "",
            "bootID": "",
            "kernelVersion": "",
            "osImage": "",
            "containerRuntimeVersion": "",
            "kubeletVersion": "",
            "kubeProxyVersion": "",
            "operatingSystem": "",
            "architecture": ""
        }
    }
}

How to reproduce it

remove status.addresses field from a single k8s node returned by https://172.20.0.1/api/v1/nodes?pretty=false

Expected behaviour

such node should not prevent all other falco pods from starting

Environment

jasondellaluce commented 1 year ago

/milestone 0.34.0