LeastAuthority / leastauthority.com

Least Authority S4
https://leastauthority.com/
Other
14 stars 19 forks source link

Sometimes the crawlers in the storage servers stop crawling #686

Open exarkun opened 6 years ago

exarkun commented 6 years ago

There are two crawlers. One is the "bucket" crawler. The other is the "accounting crawler". They infinitely loop, inspecting state of the storage system and performing various bookkeeping. Sometimes, however, they don't infinitely loop. They stop looping and stop doing their jobs.

This seems to be accompanied by an error like this (one per crawler):

2018-01-11T10:01:30+0000 [HTTP11ClientProtocol,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>,
 <twisted.python.failure.Failure twisted.web.http._DataLoss: >]

Apparently there's an errback missing somewhere. Once this happens, the crawlers won't crawl until the process is restarted.

exarkun commented 6 years ago

Two reads through of the code that I think is relevant here didn't yield any enlightenment for me.

A mitigation strategy could be to teach Kubernetes to notice that at least one crawler has died so that the affected storageserver can be restarted automatically. This doesn't fix the fault but it does fix the failure.