ManageIQ / kubeclient

A Ruby client for Kubernetes REST API
MIT License
416 stars 166 forks source link

improve informer test #594

Closed grosser closed 1 year ago

grosser commented 1 year ago

making all the threads terminate and no longer killing things ... 3 green ci runs 🤞

cben commented 1 year ago

@agrare I don't know if you have been following the Kubeclient::Informer class on master branch, but perhaps it's interesting for future use in manageiq (?). Either way, you may have useful experience to share here from how manageiq is killing & restarting watches... That's a long way of saying "please review" :wink:

[P.S. entirely out-of-scope here, but there is the unfinished business of #488. The blocker there was we didn't know how to implement watch .finish with Faraday :shrug: @agrare I was considering maybe dropping .finish in 5.0 and letting you figure out how to bring it back when you want to upgrade ManageIQ :stuck_out_tongue_closed_eyes: But now I realize Informer relies on it deeply, and I suspect so would any advanced usage of watches?]

agrare commented 1 year ago

I don't know if you have been following the Kubeclient::Informer class on master branch, but perhaps it's interesting for future use in manageiq (?)

Hey @cben! No I hadn't seen this, it definitely looks intriguing though we try not to cache inventory to keep our memory usage down so I doubt we'd use it directly but I can definitely see how this would be useful if your code were only interacting with kubeclient.

There definitely could be a more user friendly interface around watches since it seems we are solving the same issues.

Either way, you may have useful experience to share here from how manageiq is killing & restarting watches... That's a long way of saying "please review"

I notice a few minor differences in how we handle this compared to the Informer class:

  1. Our initial collection we don't pass resource_version: '0' to the get_
  2. If a watch breaks unexpectedly we retry with the same resourceVersion where it looks like this gets the whole collection again (?)
  3. If we get an error with 410 Gone we set resourceVersion to nil and restart the watch.
  4. We use a Concurrent::AtomicBoolean to indicate to the watch thread if it should break out after the watch.finish or retry

I believe we pulled this logic from the kubevirt/cnv ManageIQ provider which was contributed by the kubevirt team. Not saying this is better or worse than the Informer implementation just noting the differences.

But now I realize Informer relies on it deeply, and I suspect so would any advanced usage of watches?]

Yes we do use #finish to get a watch to break out so that we can join cleanly but we join with a timeout and .kill the thread if it doesn't respond. We also aren't trying to keep a cache consistent from this thread though.

What is the purpose of the waiter thread? I looks like it'll stop the watch after 15 minutes?

grosser commented 1 year ago

I think this is a good step forward, so prefer to merge and then iterate further if there are still open issues.

cben commented 1 year ago

Sorry, missed your reply entirely.

FWIW, Thread.pass is advisory, so not sure it can really "fix" things:

Give the thread scheduler a hint to pass execution to another thread. A running thread may or may not switch, it depends on OS and processor.

I'm still somewhat concerned a call to stop_worker might get stuck for a long time.
But yes, previous behavior was chaotic, so I guess it's an advance. :+1:

And I'm certainly happy about CI looking reliably green now :tada: Merging.