fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

Refresh k8s client on 'Unathorized' exceptions #337

Closed PettitWesley closed 2 years ago

PettitWesley commented 2 years ago

Signed-off-by: Wesley Pettit wppttt@amazon.com

323

PettitWesley commented 2 years ago

@jcantrill Is this design for solving expired tokens acceptable for you? I've been testing it myself it works. I work for/with Amazon EKS and we have a lot of customers who need some sort of fix for expired tokens. Happy to change the code here to be whatever would suit your standards.

@jimmidyson

PettitWesley commented 2 years ago

I created this little test script and I am running it on a cluster where the token expiration is set to 1 hour. This should prove if it works and is safe or not:

while true
do
  kubectl apply -f app.yaml
  sleep 444
  POD=$(kubectl get pods -o name)
  kubectl label --overwrite  $POD status=hi
  sleep 444
  kubectl label --overwrite  $POD status=testing
  sleep 444
  kubectl label --overwrite  $POD status=hopefullyitworks
  sleep 444
  kubectl apply -f app2.yaml
  kubectl create namespace test
  sleep 444
  kubectl delete -f app.yaml
  POD=$(kubectl get pods -o name)
  kubectl label --overwrite  $POD status=hi
  sleep 444
  kubectl label --overwrite  $POD status=testing
  sleep 444
  kubectl label --overwrite  $POD status=hopefullyitworks
  sleep 444
  kubectl delete -f app2.yaml
  kubectl delete namespace test
  sleep 444
done
PettitWesley commented 2 years ago

@jcantrill I actually didn't think this was ready yet. I am still performing the final long running tests.

Here is the output after running for almost 100 hours, with watch true on a cluster with the token expiration set to 1 hour, so the token would need to be refreshed about 100 times in this run- there are no errors or failures that I can see, and below is the current log output. This was with my script that continually restarts the same 2 pods and creates and deletes a namespace and changes labels on the pods.

2022-06-26 00:55:17 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:55:17.846993520 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:55:51.846743737 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:55:51 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:56:27 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:56:27.847043383 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:57:03 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:57:03.848472347 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:57:37 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:57:37.846396057 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:58:13 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:58:13.850007351 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:58:49.849233695 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:58:49 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:59:23.850891448 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:59:23 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:59:59 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:59:59.847002656 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:00:35 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:00:35.847178135 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:01:09 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:01:09.847133076 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:01:45 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:01:45.847048708 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:02:21 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:02:21.845743474 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:02:55.847117226 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:02:55 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:03:31.846258958 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:03:31 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:04:07.851858517 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:04:07 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:04:41.846592617 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:04:41 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:05:18.847502683 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:05:18 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:05:54 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:05:54.854894194 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:06:28 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:06:28.848381563 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:07:04 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:07:04.856832648 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:07:40 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:07:40.846404396 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:08:14.846693561 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:08:14 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:08:50 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:08:50.846042861 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:09:26 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:09:26.847326543 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
larivierec commented 2 years ago

This was with my script that continually restarts the same 2 pods and creates and deletes a namespace and changes labels on the pods.

I'd like a small clarification, do you restart your pods? I don't believe this would be the right way, considering a restarted pod will always reload the token from the new file.

Would it be possible to test without restarting the pods?

PettitWesley commented 2 years ago

@larivierec

I'd like a small clarification, do you restart your pods?

My script restarts the app pods, not the Fluentd pod. This tests fluentd picking up new pods which requires a new request to the API server. See in my script the kubectl delete -f app.yaml and kubectl delete -f app2.yaml- those are my app pods. My Fluentd pod is fluentd.yaml and it was running continuously without restarts for the 100 hours. I checked.

larivierec commented 2 years ago

Alright thanks for the clarification, just wanted to make sure! 👍

PettitWesley commented 2 years ago

@larivierec @jcantrill Unfortunately, this fix may not fully work. I sincerely apologize for this. There are two reasons why this happened:

  1. My EKS cluster that I used for final tested was set up by my partner on this project, and I didn't fully understand how it works. Every time the instances refresh, the 1 hour token expiration is removed apparently, so during the time when I did this testing the cluster didn't actually work how I thought it did, so my results are not valid :( :( I am very sorry about this and me and my team will work to remedy it and be better.
  2. I was expecting that you would wait to merge until I had stated that I had completed the full testing.

Let's fix this. I will post an update within a few hours on the status once I get the testing cluster working again. Again I apologize.

EDIT: OK actually I might have panicked too much, I think it fully works I just need to redo the testing to feel safe.

PettitWesley commented 2 years ago

OK actually I might have panicked too much, I think it fully works I just need to redo the testing to feel safe.

larivierec commented 2 years ago

As I cannot really confirm that this is working at the moment. The version is properly updated in the gems so it should be running the latest build. I tested this on a 1.21 EKS cluster and since the BoundServiceAccountTokenVolume Refresh AWS expires only fully after 90 days (i think) well there's no way for me to know for sure within a valid time frame.

I know for a fact that I still receive these messages:

annotations.authentication.k8s.io/stale-token
subject: system:serviceaccount:logging:fluentd-es, seconds after warning threshold: 1895
annotations.authorization.k8s.io/decision: allow
annotations.authorization.k8s.io/reason: RBAC: allowed by ClusterRoleBinding "fluentd-es" of ClusterRole "fluentd-es" to ServiceAccount "fluentd-es/logging"
PettitWesley commented 2 years ago

@larivierec Yea so this is why my testing was invalid. The code I wrote is reactive not proactive, it refreshes when the token expires, not when its stale. Since the 1 hour token refresh on my cluster was removed due to a communication mistake, my testing isn't valid since the token never expired in my latest tests (I did do a test earlier with the 1 hour expiration and it worked but then I made a few changes and in the earlier test I didn't check changing labels and all the things my script docs... hence I think it works but I can't be certain yet...)

PettitWesley commented 2 years ago

We have fixed our cluster to expire tokens in 1 hour. So we will know soon if this works. I'm not too worried...

PettitWesley commented 2 years ago

We have now validated that the watch true behavior works... so it should be good... to be safe I will let this test run a bit longer and then put in a test for watch false over night.

PettitWesley commented 2 years ago

And we have now validated that watch false fully works! Sorry for the panic! But thorough pre-release testing is always important!

hvaghani221 commented 2 years ago

@PettitWesley from what I have understood, the token will be updated when the client receives an unauthorised error. So it is expected that we will see seconds after warning threshold in API server audit logs once every hour.

smrutimandal commented 2 years ago

@PettitWesley Thanks for this. Just to clarify, does it also mean that for EKS 1.21, the token won't be refreshed for 90 days as the kube-apiserver doesn't throw a 401 for the stale tokens for 90 days?

I am running the following:

eks: 1.21
fluentd image: v1.14.6-debian-1.1
fluent-plugin-kubernetes_metadata_filter gem: 2.11.1
PettitWesley commented 2 years ago

@smrutimandal Correct.

Ga13Ou commented 11 months ago

We have fixed our cluster to expire tokens in 1 hour. So we will know soon if this works. I'm not too worried...

@PettitWesley Can you tell how did you manage to configure your EKS cluster to expire token in 1 hour? it seems like such configuration can only be done at apiserver level, which is not customizable when using EK

PettitWesley commented 11 months ago

Sorry, I no longer remember.

PettitWesley commented 11 months ago

There was some API command that allowed us to set that. Search docs and github.