dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
54 stars 30 forks source link

Still more fun with BPKI failures #106

Closed sraustein closed 13 years ago

sraustein commented 13 years ago

Proximate cause of this is BPKI expiration on work0, but why these symptoms?

  1. Keep-alive has triggered. Why?
  2. IOError tracebacks. Why?

{{{ Date: Tue, 04 Oct 2011 11:26:54 +0900 From: Randy Bush randy@psg.com

Oct 4 00:33:10 work0 rpkid[41032]: cron already running, keepalive will expire at 2011-10-04T00:39:10Z Oct 4 00:35:10 work0 rpkid[41032]: cron already running, keepalive will expire at 2011-10-04T00:39:10Z Oct 4 00:37:10 work0 rpkid[41032]: cron already running, keepalive will expire at 2011-10-04T00:39:10Z Oct 4 00:39:10 work0 rpkid[41032]: cron keepalive threshold 2011-10-04T00:39:10Z has expired, breaking lock Oct 4 00:39:11 work0 rpkid[41032]: : Error on HTTP client connection arin.rpki.net:4404 <class 'rpki.exceptions.HTTPRequestFailed'> HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response Oct 4 00:39:11 work0 rpkid[41032]: : Closing due to error Oct 4 00:39:11 work0 rpkid[41032]: : Returning exception HTTPRequestFailed('HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response ',) to caller: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response Oct 4 00:39:11 work0 rpkid[41032]: Couldn't get resource class list from parent <rpki.left_right.parent_elt object at 0x804c6fdd0>, skipping: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response Oct 4 00:39:11 work0 rpkid[41032]: : Error on HTTP client connection arin.rpki.net:4404 <class 'rpki.exceptions.HTTPRequestFailed'> HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response Oct 4 00:39:11 work0 rpkid[41032]: : Closing due to error Oct 4 00:39:11 work0 rpkid[41032]: : Returning exception HTTPRequestFailed('HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response ',) to caller: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response Oct 4 00:39:11 work0 rpkid[41032]: Couldn't get resource class list from parent <rpki.left_right.parent_elt object at 0x804c6fe10>, skipping: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response Oct 4 00:39:11 work0 irdbd[41031]: Exception caught in handler() at /usr/local/lib/python2.6/site-packages/rpki/irdbd.py:177 called from /usr/local/lib/python2.6/site-packages/rpki/http.py:535 Oct 4 00:39:11 work0 irdbd[41031]: Traceback (most recent call last): Oct 4 00:39:11 work0 irdbd[41031]: File "/usr/local/lib/python2.6/site-packages/rpki/irdbd.py", line 150, in handler Oct 4 00:39:11 work0 irdbd[41031]: q_msg = rpki.left_right.cms_msg(DER = query).unwrap((self.bpki_ta, self.rpkid_cert)) Oct 4 00:39:11 work0 irdbd[41031]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 1236, in unwrap Oct 4 00:39:11 work0 irdbd[41031]: self.verify(ta) Oct 4 00:39:11 work0 irdbd[41031]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 962, in verify Oct 4 00:39:11 work0 irdbd[41031]: sys.stderr.write("CMS verification failed, dumping ASN.1 (%d octets):\n%s\n" % (len(self.get_DER()), dbg)) Oct 4 00:39:11 work0 irdbd[41031]: IOError: [Errno 6] Device not configured Oct 4 00:39:11 work0 rpkid[41032]: Exception caught in return_result() at /usr/local/lib/python2.6/site-packages/rpki/http.py:923 called from /usr/local/lib/python2.6/site-packages/rpki/http.py:777 Oct 4 00:39:11 work0 rpkid[41032]: Exception caught in return_result() at /usr/local/lib/python2.6/site-packages/rpki/http.py:923 called from /usr/local/lib/python2.6/site-packages/rpki/http.py:777 Oct 4 00:39:11 work0 rpkid[41032]: Traceback (most recent call last): Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/http.py", line 914, in return_result Oct 4 00:39:11 work0 rpkid[41032]: req.callback(result.body) Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/rpkid.py", line 159, in unwrap Oct 4 00:39:11 work0 rpkid[41032]: r_msg = r_cms.unwrap((self.bpki_ta, self.irdb_cert)) Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 1236, in unwrap Oct 4 00:39:11 work0 rpkid[41032]: self.verify(ta) Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 962, in verify Oct 4 00:39:11 work0 rpkid[41032]: sys.stderr.write("CMS verification failed, dumping ASN.1 (%d octets):\n%s\n" % (len(self.get_DER()), dbg)) Oct 4 00:39:11 work0 rpkid[41032]: IOError: [Errno 6] Device not configured }}}

The IOError exception might conceivably be due to attempting to write to sys.stderr while running as a detached daemon which has no stderr. Worth testing, anyway

Trac ticket #94 component rpkid priority minor, owner sra, created by sra on 2011-10-04T03:37:04Z, last modified 2011-10-05T00:11:21Z

sraustein commented 13 years ago

In [4008]: {{{

!CommitTicketReference repository="" revision="4008"

Catch IOError when dumping bad CMS (see #94). }}}

Trac comment by sra on 2011-10-04T13:51:58Z

sraustein commented 13 years ago

Presumably keep-alive is triggering due to dropped event.

I see two problems in rpki.http.http_queue.return_result():

  1. Exceptions raised in req.callback() will end up in "unhandled exception from callback" catch-all, which looks like it drops the event. Either exceptions from req.callback() should fail over to req.errback(), or code to which those callbacks refer is responsible for catching all errors.
  2. Logging level of "Unhandled exception from callback" message is too low to show up in Randy's log. I guess this is not so critical, as any traceback showing up in log ought to be considered abnormal, but it confused me when trying to confirm origin of the exception.

Trac comment by sra on 2011-10-04T14:11:31Z

sraustein commented 13 years ago

Er, except that the irdbd case is a traceback() call in rpki.irdbd.handler(). Hmm. Apparently I used to think putting tracebacks in the log at every opportunity was a good idea. Sigh.

Trac comment by sra on 2011-10-04T17:23:21Z

sraustein commented 13 years ago

In [4009]: {{{

!CommitTicketReference repository="" revision="4009"

Rework http_queue exception handling not to drop the event chain if callback handler throws an exception, eg, due to CMS validation failure. This closes #94. }}}

Trac comment by sra on 2011-10-04T18:06:48Z

sraustein commented 13 years ago

Spewing on stderr is not an improvement. Break dumpasn1 output up into chunks small enough to feed to syslog if we're going to use it at all, and perhaps use OpenSSL's icky dump format instead rather than relying on dumpasn1 being installed to get CMS failure logging.

Trac comment by sra on 2011-10-04T20:14:04Z

sraustein commented 13 years ago

In [4010]: {{{

!CommitTicketReference repository="" revision="4010"

Don't spew to stderr if dumpasn1 isn't available when we want to dump CMS that doesn't validate (see #94). Switch CMS-dumping code to use OpenSSL library code rather than dumpasn1 -- dumpasn1 is prettier, but not enough prettier to be worth making people install yet another freaking program that's only used to diagnose strange failures. }}}

Trac comment by sra on 2011-10-04T20:40:04Z

sraustein commented 13 years ago

Well, we probably nailed the problem that was causing event chain drop, anyway. Logs are inconclusive due to syslogd flakeout.

BPKI expiration seems like one of a handful of CMS validation failures that might be worth special diagnoisis rather than just dumping the offending CMS. In theory this is not difficult: if CMS fails, we attempt to extract the cert (if that fails, we know what the problem is!) and check its notAfter against current time.

Randy suggests that perhaps we should be warning about impending BPKI expiration. We'd probably want to integrate this with the CMS replay protection and keep state in SQL to avoid whining about it too much.

Trac comment by sra on 2011-10-04T21:25:28Z

sraustein commented 13 years ago

Doh, no, we did fix the event chain drop, logs do show it: dumpasn1 vs OpenSSL log of CMS failure wasn't the issue, it was failure to write whichever one of those we were using today, and that now works. Good.

Trac comment by sra on 2011-10-04T21:27:51Z

sraustein commented 13 years ago

In [4011]: {{{

!CommitTicketReference repository="" revision="4011"

Add explicit check for certificate expiration in CMS verify (see #94). }}}

Trac comment by sra on 2011-10-04T22:56:29Z

sraustein commented 13 years ago

In [4012]: {{{

!CommitTicketReference repository="" revision="4012"

More explicit certificate expiration checks in CMS verify (see #94). }}}

Trac comment by sra on 2011-10-04T23:33:29Z

sraustein commented 13 years ago

OK, other than some lingering silly tracebacks (issue moved to #95) and the dumb way we (fail to) deal with which should be in exception instances (issue moved to #96), the latest code (as of [4011]) seems to handle BPKI expiration relatively cleanly. It indicates the problem and gets out, without breaking the event chain.

This does not yet do anything about Randy's wish for advance warning of BPKI certificate expiration. I'm not sure (either way) that this functionality belongs in the daemons. Need to think about it more, and is somewhat separable from today's problem, so I've moved it to yet another separate ticket (#97) to keep this from turning into the ticket that ate Tokyo.

Trac comment by sra on 2011-10-05T00:11:21Z

sraustein commented 13 years ago

Closed with resolution fixed