fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
351 stars 166 forks source link

Add watch 410 Gone response handling #243

Closed Ghazgkull closed 4 years ago

Ghazgkull commented 4 years ago

This PR addresses Issue #226 by adding handling for 410 Gone responses to the pod watch handling in this plugin.

I've included a unit test which verifies the expected behavior.

Ghazgkull commented 4 years ago

Please note that after some reflection and further reading, I pushed a new commit which handles the error in a better way. The approach here is to throw a typed exception (GoneError) which can be caught in set_up_pod_thread in order to restart get_pods_and_start_watcher. This means we'll go back out and get the latest resourceVersions, which should allow us to gracefully recover from the 410.

jcantrill commented 4 years ago

@Ghazgkull this LGTM. Do you have anything you can post ouside of the unit test to confirm this addresses the issue? Have you built and run it manually?

Ghazgkull commented 4 years ago

@jcantrill I'd love to do more validation, but I would need some guidance.

Prior to submitting this PR, my experience with fluentd is limited to deploying the fluentd-elasticsearch helm chart to my clusters. Trying to imagine how I'd be able to test this out in a deployment feels like it might be a major undertaking. I imagine I'd have to:

  1. Build and publish this plugin somewhere. I haven't worked in Ruby in years so I don't remember anything about where gems are published and how to publish pre-release versions. I'd have to figure that out.
  2. Fork the elasticsearch-fluentd repo and bump the version of this plugin in their Gemfile. Then build and publish that Docker image somewhere. That doesn't sound too bad as long as their project can be built without a lot of setup.
  3. Update the elasticsearch-fluentd helm chart to use that image. That bit is trivial because the helm chart allows to specify what image you want deployed without modifying the chart.
  4. Hope that the 410 Gone error actually occurs in one of my clusters. Monitor my logs for a day or two and see if the "Restarting watch to reset resource versions" message shows up. I actually have a cluster that's been exhibiting this behavior recently (hence my interest in the fix), so this might be doable.
jcantrill commented 4 years ago

There is probably an easier way to test it but further reviewing the changes, I don't believe we are any worse off then before and the unit tests confirm behavior.