Closed sraustein closed 13 years ago
In [4008]: {{{
Catch IOError when dumping bad CMS (see #94). }}}
Trac comment by sra on 2011-10-04T13:51:58Z
Presumably keep-alive is triggering due to dropped event.
I see two problems in rpki.http.http_queue.return_result():
Trac comment by sra on 2011-10-04T14:11:31Z
Er, except that the irdbd case is a traceback() call in rpki.irdbd.handler(). Hmm. Apparently I used to think putting tracebacks in the log at every opportunity was a good idea. Sigh.
Trac comment by sra on 2011-10-04T17:23:21Z
In [4009]: {{{
Rework http_queue exception handling not to drop the event chain if callback handler throws an exception, eg, due to CMS validation failure. This closes #94. }}}
Trac comment by sra on 2011-10-04T18:06:48Z
Spewing on stderr is not an improvement. Break dumpasn1 output up into chunks small enough to feed to syslog if we're going to use it at all, and perhaps use OpenSSL's icky dump format instead rather than relying on dumpasn1 being installed to get CMS failure logging.
Trac comment by sra on 2011-10-04T20:14:04Z
In [4010]: {{{
Don't spew to stderr if dumpasn1 isn't available when we want to dump CMS that doesn't validate (see #94). Switch CMS-dumping code to use OpenSSL library code rather than dumpasn1 -- dumpasn1 is prettier, but not enough prettier to be worth making people install yet another freaking program that's only used to diagnose strange failures. }}}
Trac comment by sra on 2011-10-04T20:40:04Z
Well, we probably nailed the problem that was causing event chain drop, anyway. Logs are inconclusive due to syslogd flakeout.
BPKI expiration seems like one of a handful of CMS validation failures that might be worth special diagnoisis rather than just dumping the offending CMS. In theory this is not difficult: if CMS fails, we attempt to extract the cert (if that fails, we know what the problem is!) and check its notAfter against current time.
Randy suggests that perhaps we should be warning about impending BPKI expiration. We'd probably want to integrate this with the CMS replay protection and keep state in SQL to avoid whining about it too much.
Trac comment by sra on 2011-10-04T21:25:28Z
Doh, no, we did fix the event chain drop, logs do show it: dumpasn1 vs OpenSSL log of CMS failure wasn't the issue, it was failure to write whichever one of those we were using today, and that now works. Good.
Trac comment by sra on 2011-10-04T21:27:51Z
In [4011]: {{{
Add explicit check for certificate expiration in CMS verify (see #94). }}}
Trac comment by sra on 2011-10-04T22:56:29Z
In [4012]: {{{
More explicit certificate expiration checks in CMS verify (see #94). }}}
Trac comment by sra on 2011-10-04T23:33:29Z
OK, other than some lingering silly tracebacks (issue moved to #95) and the dumb way we (fail to) deal with which should be in exception instances (issue moved to #96), the latest code (as of [4011]) seems to handle BPKI expiration relatively cleanly. It indicates the problem and gets out, without breaking the event chain.
This does not yet do anything about Randy's wish for advance warning of BPKI certificate expiration. I'm not sure (either way) that this functionality belongs in the daemons. Need to think about it more, and is somewhat separable from today's problem, so I've moved it to yet another separate ticket (#97) to keep this from turning into the ticket that ate Tokyo.
Trac comment by sra on 2011-10-05T00:11:21Z
Closed with resolution fixed
Proximate cause of this is BPKI expiration on work0, but why these symptoms?
{{{ Date: Tue, 04 Oct 2011 11:26:54 +0900 From: Randy Bush randy@psg.com
Oct 4 00:33:10 work0 rpkid[41032]: cron already running, keepalive will expire at 2011-10-04T00:39:10Z Oct 4 00:35:10 work0 rpkid[41032]: cron already running, keepalive will expire at 2011-10-04T00:39:10Z Oct 4 00:37:10 work0 rpkid[41032]: cron already running, keepalive will expire at 2011-10-04T00:39:10Z Oct 4 00:39:10 work0 rpkid[41032]: cron keepalive threshold 2011-10-04T00:39:10Z has expired, breaking lock Oct 4 00:39:11 work0 rpkid[41032]:: Error on HTTP client connection arin.rpki.net:4404 <class 'rpki.exceptions.HTTPRequestFailed'> HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response
Oct 4 00:39:11 work0 rpkid[41032]: : Closing due to error
Oct 4 00:39:11 work0 rpkid[41032]: : Returning exception HTTPRequestFailed('HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response ',) to caller: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response
Oct 4 00:39:11 work0 rpkid[41032]: Couldn't get resource class list from parent <rpki.left_right.parent_elt object at 0x804c6fdd0>, skipping: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response
Oct 4 00:39:11 work0 rpkid[41032]: : Error on HTTP client connection arin.rpki.net:4404 <class 'rpki.exceptions.HTTPRequestFailed'> HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response
Oct 4 00:39:11 work0 rpkid[41032]: : Closing due to error
Oct 4 00:39:11 work0 rpkid[41032]: : Returning exception HTTPRequestFailed('HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response ',) to caller: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response
Oct 4 00:39:11 work0 rpkid[41032]: Couldn't get resource class list from parent <rpki.left_right.parent_elt object at 0x804c6fe10>, skipping: HTTP request failed with status 400, reason Could not process PDU: CMS verification failed, response
Oct 4 00:39:11 work0 irdbd[41031]: Exception caught in handler() at /usr/local/lib/python2.6/site-packages/rpki/irdbd.py:177 called from /usr/local/lib/python2.6/site-packages/rpki/http.py:535
Oct 4 00:39:11 work0 irdbd[41031]: Traceback (most recent call last):
Oct 4 00:39:11 work0 irdbd[41031]: File "/usr/local/lib/python2.6/site-packages/rpki/irdbd.py", line 150, in handler
Oct 4 00:39:11 work0 irdbd[41031]: q_msg = rpki.left_right.cms_msg(DER = query).unwrap((self.bpki_ta, self.rpkid_cert))
Oct 4 00:39:11 work0 irdbd[41031]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 1236, in unwrap
Oct 4 00:39:11 work0 irdbd[41031]: self.verify(ta)
Oct 4 00:39:11 work0 irdbd[41031]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 962, in verify
Oct 4 00:39:11 work0 irdbd[41031]: sys.stderr.write("CMS verification failed, dumping ASN.1 (%d octets):\n%s\n" % (len(self.get_DER()), dbg))
Oct 4 00:39:11 work0 irdbd[41031]: IOError: [Errno 6] Device not configured
Oct 4 00:39:11 work0 rpkid[41032]: Exception caught in return_result() at /usr/local/lib/python2.6/site-packages/rpki/http.py:923 called from /usr/local/lib/python2.6/site-packages/rpki/http.py:777
Oct 4 00:39:11 work0 rpkid[41032]: Exception caught in return_result() at /usr/local/lib/python2.6/site-packages/rpki/http.py:923 called from /usr/local/lib/python2.6/site-packages/rpki/http.py:777
Oct 4 00:39:11 work0 rpkid[41032]: Traceback (most recent call last):
Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/http.py", line 914, in return_result
Oct 4 00:39:11 work0 rpkid[41032]: req.callback(result.body)
Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/rpkid.py", line 159, in unwrap
Oct 4 00:39:11 work0 rpkid[41032]: r_msg = r_cms.unwrap((self.bpki_ta, self.irdb_cert))
Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 1236, in unwrap
Oct 4 00:39:11 work0 rpkid[41032]: self.verify(ta)
Oct 4 00:39:11 work0 rpkid[41032]: File "/usr/local/lib/python2.6/site-packages/rpki/x509.py", line 962, in verify
Oct 4 00:39:11 work0 rpkid[41032]: sys.stderr.write("CMS verification failed, dumping ASN.1 (%d octets):\n%s\n" % (len(self.get_DER()), dbg))
Oct 4 00:39:11 work0 rpkid[41032]: IOError: [Errno 6] Device not configured
}}}
The IOError exception might conceivably be due to attempting to write to sys.stderr while running as a detached daemon which has no stderr. Worth testing, anyway
Trac ticket #94 component rpkid priority minor, owner sra, created by sra on 2011-10-04T03:37:04Z, last modified 2011-10-05T00:11:21Z