Open exarkun opened 6 years ago
Two reads through of the code that I think is relevant here didn't yield any enlightenment for me.
A mitigation strategy could be to teach Kubernetes to notice that at least one crawler has died so that the affected storageserver can be restarted automatically. This doesn't fix the fault but it does fix the failure.
There are two crawlers. One is the "bucket" crawler. The other is the "accounting crawler". They infinitely loop, inspecting state of the storage system and performing various bookkeeping. Sometimes, however, they don't infinitely loop. They stop looping and stop doing their jobs.
This seems to be accompanied by an error like this (one per crawler):
Apparently there's an errback missing somewhere. Once this happens, the crawlers won't crawl until the process is restarted.