No metadata when switching to containerd windows nodes

danfinn commented 2 years ago

fluentd 1.13 fluent-plugin-kubernetes_metadata_filter 2.6

Currently we use this plugin to add metadata to logs from windows containers running on nodes that use the docker runtime environment. As of k8s 1.23 the default is to use containerd and I am doing some research to make sure all of our existing stuff will continue to work when switching over to containerd. On an existing cluster (in Azure) I added a new kubernetes 1.21.7 windows nodepool that uses containerd and I deployed a test windows container to it.

Since we are using a fluentd daemonset a fluentd pod was automatically deployed to this new node with our existing configuration. Nothing changed. We use fluentd to ship logs to elasticsearch. The logs from the pod on this new containerd node are making it to elasticsearch however there is no kubernetes metadata being attached.

I see this error in the logs for the fluentd pod occasionally:

2022-05-07 03:06:17 +0000 [info]: [filter_kube_metadata] Exception encountered parsing pod watch event. The connection might have been closed. Sleeping for 1 seconds and resetting the pod watcher.error reading from socket: An existing connection was forcibly closed by the remote host.

I'm not entirely sure but it sounds like this might be an error from the metadata filter trying to talk to the API however that's just a guess and it's not entirely clear from that error. If that is the case I'm not sure why that would be since nothing in our fluentd configuration has changed on this new containerd node.

jcantrill commented 2 years ago

That message originates here https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/612a5c7311754b728828a1066f1b0fa5b6cd53ab/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb#L56 Reads as though there is either a networking or trust issue between the apiserver and the node running fluentd

danfinn commented 2 years ago

I'd agree with that assessment however nothing has changed other than using containerd on the new nodes. The "old" nodes using the docker runtime env send along the metadata just fine.

jcantrill commented 2 years ago

I think the investigation starts from the api server side. It says "closed by remote host" which means to me that fluent assumes it works just fine and it's the remote that doesn't like something. Maybe there are indicators there which could tell us what this plugin is passing along that that the remote doesn't like

danfinn commented 2 years ago

We are using Azure and unfortunately it seems a bit tricky to get access to the api server logs. I'm working on that now. I did go through all the kubelet logs on the node itself and don't see anything there.

danfinn commented 2 years ago

Can you tell me the actual API request being made? I'm scanning through the code but not seeing it. I'd like to try that manually from the containerd node with curl to see if it fails.

Also, is there a way to turn on debug logging or make it more verbose for troubleshooting?

jcantrill commented 2 years ago

Can you tell me the actual API request being made? I'm scanning through the code but not seeing it. I'd like to try that manually from the containerd node with curl to see if it fails.

The call is made here via a ruby library https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/lib/fluent/plugin/filter_kubernetes_metadata.rb#L115 I can't speak to the logging that is available.

If you want to see straight REST calls you can bump the log level of the various binaries (e.g. oc or kubectl) and they will output detailed calls that can be fed to curl.

Also, is there a way to turn on debug logging or make it more verbose for troubleshooting?

The plugin relies on fluent logger and log levels. Please see the fluentd documentation

danfinn commented 2 years ago

I was able to confirm that the fluentd pods running on these containerd nodes can hit the k8s api with curl

curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --header "Authorization: Bearer $secret" -X GET https://10.0.0.1/api -k
{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "$long_id.privatelink.eastus.azmk8s.io:443"
    }
  ]
}

danfinn commented 2 years ago

Those log.trace statements in the code, where do those output to? I don't think I've seen those in the logs from the fluentd containers.

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/d2cfed17cff7e8a18d0df6ea8461bd475cdacb51/lib/fluent/plugin/filter_kubernetes_metadata.rb#L116

Do I need to set the fluentd log level to trace to see these?

This doesn't make a ton of sense. Nothing has changed other than switching to containerd. The fluentd pods are still running as the same service account which is what would determine their level of access to the API and I have confirmed that they can still access the k8s api with the service account information.

danfinn commented 2 years ago

I enabled trace logging which led me to a few discoveries. The first was that other pods on this same node had metadata for their logs. The next thing I realized was that the logs I was looking at that did not have metadata were all coming from the windows event log which we capture but we do not add metadata for. Once I targeted logs that were not captures from the windows event log I could see that they do indeed have metadata.

This can be closed.

fabric8io / fluent-plugin-kubernetes_metadata_filter

No metadata when switching to containerd windows nodes #333