Closed conneryn closed 1 year ago
Hi @conneryn! Thanks again for your detailed bug (& fix!). Gonna review it in a bit
PR #29 is good to go / merged. I'll prepare a new release after the build finishes
Sorry for the delay. I've just published the fix under 0.7.1 (helm chart tor-controller-0.1.7)
Describe the bug After running an
OnionBalancedService
for a period of time, eventually the onion address is no longer resolvable.Attempting to reach my onion service via the tor browser returns:
All "obb" pods appear to be working as expected, but the "daemon" pod potentially has deadlocked after a restart (see below for details). Deleting the daemon pod, and allowing it to be recreated/restarted resolves the issue.
To Reproduce I have not figured out specific steps to reproduce this yet, other than waiting long enough. Although, I have a suspicion it happens when the pod restarts itself (I will continue to try and narrow down more specific repro steps).
Expected behavior The onion service should always be available as long as the daemon and obb pods are running.
Additional information
Logs from the
onionbalance
container of thedaemon
pod:NOTE: the actual time is now 8 hours later, so
onionbalance
has not logged any additional activity for quite some time (deadlock?).On a successful launch, I see something along the lines of:
System (please complete the following information):
Additional context This does not happen often, but it has occurred 4 or 5 times over the past ~3 months. Anecdotally, I believe the last few times this has happened was after/around performing system upgrades on my cluster (ex: upgrading Kubernetes, or restarting nodes), where lots of pods are bouncing around.
The remedy is simple (manually restart the daemon pod), but an automated fix would be preferred. If actually resolving the deadlock (if that's truly the issue...) is overly complex to diagnose at this time, I wonder if an easier fix might be to simply add a probe that can properly detect this condition? Any thoughts on how I could do this?