dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
54 stars 30 forks source link

Weird rpkid connection failures #82

Closed sraustein closed 13 years ago

sraustein commented 13 years ago

Child rpkid problems persisted after cleaning up (I think) the pseudo-RIR BPKI problems detailed in #66, #67, #68, #69. cron keepalive code is triggering, and tcpdump shows no attempt to contact parent.

Last mention of pseudo-ARIN in ISC rpkid is:

{{{ 2011-09-03 13:56:14 rpkid[1138]: : Timeout while in state request-sent 2011-09-03 13:56:14 rpkid[1138]: Unhandled exception from timer: 2011-09-03 13:56:14 rpkid[1138]: Exception caught in errback() at /u/sra/subvert-rpki.hactrn.net/rpkid/rpki/async.py:189 called from /u/sra/subvert-rpki.hactrn.net/rpkid/rpki/a\ sync.py:212 2011-09-03 13:56:14 rpkid[1138]: Traceback (most recent call last): 2011-09-03 13:56:14 rpkid[1138]: File "/u/sra/subvert-rpki.hactrn.net/rpkid/rpki/async.py", line 208, in runq 2011-09-03 13:56:14 rpkid[1138]: t.handler() 2011-09-03 13:56:14 rpkid[1138]: File "/u/sra/subvert-rpki.hactrn.net/rpkid/rpki/http.py", line 804, in handle_timeout 2011-09-03 13:56:14 rpkid[1138]: raise rpki.exceptions.HTTPTimeout 2011-09-03 13:56:14 rpkid[1138]: HTTPTimeout }}}

I don't see why this would block all future attempts to contact parent, but maybe it left an HTTP client queue in some odd state, or something. We lack visibility into the running daemon, which is a problem in its own right (and perhaps should get its own ticket).

cron keep-alive code was already active by this point.

BPKI-related failures had started happening hours previously (2011-09-03 06:17:20).

Trac ticket #70 component rpkid priority minor, owner sra, created by sra on 2011-09-05T21:04:58Z, last modified 2011-09-07T03:30:34Z

sraustein commented 13 years ago

Seems to be running properly after restarting child rpkid.

Trac comment by sra on 2011-09-05T21:11:48Z

sraustein commented 13 years ago

Keep-alive started triggering on work0 about ten seconds after it did on adrilankha (first keepalive log message on adrilankha at 2011-09-03 13:43:14, first on work0 at 2011-09-03 13:43:04). work0 runs its own pubd, so the only obvious common path here is the pseudo-RIR rpkid on srv0.

...and looking at the log snippet in #69, we see that the pseudo-RIR rpkid went bananas at 2011-09-03 13:41:07, at which point it was in a spin loop doing a bad select(), screaming about it, and not doing one heck of a lot else.

So from the child side, this is some kind of failure to handle connection failures to a parent who has gone comatose without having the grace to close the TCP listener, send error messages, or otherwise give us a clue. Not sure why child didn't see something eventually when parent's accept() quota was filled, but the fact that this condition persists after the parent has been fixed suggests that the problem somehow triggers the child into some weird state from which the child cannot find its way home.

Trac comment by sra on 2011-09-05T21:33:06Z

sraustein commented 13 years ago

Testing confirms that, at least in the Python API, socket.connect() on a non-blocking socket will not result in any kind of timeout if the corresponding server listener socket is open but not being serviced. This is probably limited to the first n connections where n is the argument the server gave socket.listen(), after that I would expect further connections to be rejected per the usual kernel behavior.

So we need to set a timeout around the .connect() call in rpki.http.http_client.gotaddrinfo().

Trac comment by sra on 2011-09-06T17:59:09Z

sraustein commented 13 years ago

Timeout added [3971]. Leaving ticket open for now, really should figure out a way to test the code with and without this patch to confirm that it solves the problem.

Trac comment by sra on 2011-09-06T18:12:44Z

sraustein commented 13 years ago

Ok, [3971] does solve the immediate hung connection problem. Specifically, it puts us back at the nasty HTTPTimeout traceback shown above, which is an improvement in that we then try again, but we really should handle this more gracefully.

Trac comment by sra on 2011-09-06T21:32:20Z

sraustein commented 13 years ago

Also: having the timeout interval be longer than the keepalive retry interval means that we will gradually accumulate a backlog of queued connection requests. This is probably bad.

Current client timeout is probably too long anyway. I can't think of a sane reason why it would take more than five minutes for the server to answer.

Trac comment by sra on 2011-09-06T21:35:14Z

sraustein commented 13 years ago

Fixed handling of HTTPTimeout so it now makes it back to the client. [3974]

With this fix rpkid does seem to recover from having its parent go catatonic; tested manually with scripts/fakerootd.py and yamltest.py -k smoketest.6.yaml.

Trac comment by sra on 2011-09-07T03:30:34Z

sraustein commented 13 years ago

Closed with resolution fixed