dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
54 stars 30 forks source link

Bad behavior with empty RFC 3779 extensions #421

Closed sraustein closed 11 years ago

sraustein commented 11 years ago

im says (via Alex)

It seems that our parser decides that the manifest EE certificate lacks any resources (neither inherited resources, nor explicitly specified resources found). I am having a little trouble parsing RFC3779 but I think that this is not allowed by the standard. Which I why we get the strong reject in "X509ResourceCertificate.parseResourceExtensions".

And for manifests, the EE certificate MUST use inherit: http://tools.ietf.org/html/rfc6486#section-5.1

So once we have fixed our error reporting as mentioned above a more proper error message to the user would be that the manifest is rejected because it does not use inherited resources.

There is of course an off chance that the error is actually in our parser, but I have no indication of that. I would expect a more dramatic, lower level ASN.1 parsing failure in that case. So can you please check this with Randy first and if they are sure that are using inherit we can do lower level debugging.

Trac ticket #406 component rpkid priority blocker, owner sra, created by randy on 2013-01-29T15:43:58Z, last modified 2013-03-15T16:12:11Z

sraustein commented 11 years ago

It seems that our parser decides that the manifest EE certificate lacks any resources (neither inherited resources, nor explicitly specified resources found).

Whoops!

{{{ $ openssl cms -verify -noverify -inform DER -out /dev/null -certsout /dev/stdout -in root.mft 2>/dev/null | openssl x509 -noout -text -certopt no_sigdump,no_pubkey Certificate: Data: Version: 3 (0x2) Serial Number: 10 (0xa) Signature Algorithm: sha256WithRSAEncryption Issuer: CN=altCA RPKI root certificate Validity Not Before: Jan 24 03:14:06 2013 GMT Not After : Feb 23 03:14:06 2013 GMT Subject: CN=B16FCBD36852CCBFC9944B89B5C01395435021BE X509v3 extensions: X509v3 Subject Key Identifier: B1:6F:CB:D3:68:52:CC:BF:C9:94:4B:89:B5:C0:13:95:43:50:21:BE X509v3 Authority Key Identifier: keyid:62:1A:08:9F:EB:49:26:CF:54:54:E9:80:F4:CF:59:EE:1A:AA:A6:65

        X509v3 Certificate Policies: critical
            Policy: 1.3.6.1.5.5.7.14.2

        X509v3 CRL Distribution Points: 

            Full Name:
              URI:rsync://ca0.rpki.net/rpki/root.crl

        Authority Information Access: 
            CA Issuers - URI:rsync://ca0.rpki.net/rpki/root.cer

        X509v3 Key Usage: critical
            Digital Signature
        Subject Information Access: 
            1.3.6.1.5.5.7.48.11 - URI:rsync://ca0.rpki.net/rpki/root.mft

        sbgp-ipAddrBlock: critical

}}}

So, the obvious two questions are:

  1. How did Randy manage to generate this, and
  2. Why didn't rcynic reject it?

Investigating.

I am having a little trouble parsing RFC3779 but I think that this is not allowed by the standard.

Correct.

And for manifests, the EE certificate MUST use inherit: http://tools.ietf.org/html/rfc6486#section-5.1

Also correct.

So once we have fixed our error reporting as mentioned above a more proper error message to the user would be that the manifest is rejected because it does not use inherited resources.

Yep.

There is of course an off chance that the error is actually in our parser, but I have no indication of that. I would expect a more dramatic, lower level ASN.1 parsing failure in that case. So can you please check this with Randy first and if they are sure that are using inherit we can do lower level debugging.

No, I'm seeing the problem too, it's just rcynic that's not.

Trac comment by sra on 2013-01-29T19:04:20Z

sraustein commented 11 years ago

rsync://repo0.rpki.net/rpki/root/root.mft is OK, so generation side of this is not as simple as rootd always generating garbage.

I suspect Randy found some previously unknown way to break rootd that never would have occurred to me (ie, still a bug, just not one I ever would have triggered).

Taking a wild guess: Randy generated root.cer without resources, ran rootd and rpkid, then went back and fixed root.cer. This could conceivably have left a PKCS10 without resources sitting around, which would still be considered valid if none of the keys had changed.

rootd bug here would be not rejecting malformed RFC 3779 (option present but empty is not legal).

Trac comment by sra on 2013-01-29T19:18:55Z

sraustein commented 11 years ago

[Thoughts while I was driving carpool and couldn't debug anything...]

I suspect taht part of the problem here is that the OpenSSL RFC 3779 code was written against, well, RFC 3779, not the subsequent RPKI RFCs. If RFC 3779 states anywhere that a null resource set is illegal in an IPAddrBlocks or ASIdentifiers extension, I can't find it. Subsequent discussion with Steve Kent suggests that this is an omission from RFC 3779 and that the authors intended such extensions to be illegal, but that boots us nothing.

If, as I suspect, the problem is that rcynic is trusting the underlying OpenSSL library to catch this while the OpenSSL library is implementing RFC 3779, we probably want to fix it in both places. Fixing rcynic should be straightforward, and we can do that quickly. Fixing the OpenSSL library code will take a while, the patch is likely to be very small but it won't make it into distributions for a while.

So long as rcynic catches the problem, we're probably alright.

OK, back to tracing through the validation code.

Trac comment by sra on 2013-01-29T21:42:17Z

sraustein commented 11 years ago

In [changeset:4981]: {{{

!CommitTicketReference repository="" revision="4981"

Check for empty RFC 3779 extensions. See #406. }}}

Trac comment by sra on 2013-01-29T23:23:11Z

sraustein commented 11 years ago

so how do i clean up the cert?

Trac comment by randy on 2013-01-29T23:27:00Z

sraustein commented 11 years ago

In [changeset:4983]: {{{

!CommitTicketReference repository="" revision="4983"

rpki.x509.X509._issue() wasn't passing inheritance flags into rpki.POW.X509.setRFC3779(), which was resulting in empty IPAddrBlock extensions for manifests and Ghostbusters records. See #406. }}}

Trac comment by sra on 2013-01-30T00:29:48Z

sraustein commented 11 years ago

openssl-empty-rfc3779-extension-checks.diff Proposed patch to OpenSSL if RFC 3779 authors agree Trac attachment by sra on 2013-03-15T16:12:11Z

sraustein commented 11 years ago

OK, this was a bug introduced when we cut over to the "new" C-based ASN.1 code. rcynic didn't detect it, and all code installed since then has been issuing manifests and ghostbusters records with this problem.

In theory, upgrading to the fixed code [4983] and restarting daemons should fix the problem, as the objects in question will be regenerated automatically. Haven't tested this, nor have I tested whether the code that checks for changed RFC 3779 resources will automatically force immediate regeneration (rather than waiting for the current objects to expire).

Remaining tasks, which are why I am not yet closing this ticket:

  1. Schedule another interop event with Tim and Andrew, probably in Orlando, since Tim's code would have detected this, suggesting that it's been too long since the last interop event.
  2. Talk to Steve Kent and Karen Seo about whether the absence of any explicit prohibition on empty resource extensions is an error in RFC 3779, and, if so, submit an erratum report to the RFC Editor and a patch (draft attached to this ticket) to the OpenSSL project.

Trac comment by sra on 2013-01-30T00:46:07Z

sraustein commented 11 years ago

svn updated configured made installed stopped daemons started daemons

jay's copy of tim's validator, which is not the one that gives a better message, still thinks altCA is bad.

http://12.0.1.223:8080/trust-anchors altCA2

Trac comment by randy on 2013-01-30T00:54:18Z

sraustein commented 11 years ago

jay's copy of tim's validator, which is not the one that gives a better message, still thinks altCA is bad.

I did not expect fix to be instantaneous. Question is whether it's hours or days or requires manual intervention; I think not the latter but would have to poke at it more to be sure.

FYI, manual intervention, should it prove necessary, probably looks like:

  1. Stop daemons.
  2. Manually remove the manifest, CRL, subject certificate, and PKCS#10 (not published) generated by rootd. Be careful not to delete the root certificate or root key, those are //not// generated by rootd (they're also OK, as far as I know).
  3. Restart daemons.
  4. For each entity hosted by that rpkid, issue a command like the following on the command line: {{{ $ rpkic -i entity-name force_reissue }}} where "entity-name" is the rpkid "self handle" for that entity (probably the same as the name you see in the GUI, unless you've done something weird).

But it may just cure itself given a few hours. Your call.

Trac comment by sra on 2013-01-30T01:02:53Z

sraustein commented 11 years ago

i will pretend to be patient

Trac comment by randy on 2013-01-30T01:05:07Z

sraustein commented 11 years ago

well, over twelve hours later and jay's copy of tim's validator still says altCA is invalid.

Trac comment by randy on 2013-01-30T13:47:29Z

sraustein commented 11 years ago
  1. Manually remove the manifest, CRL, subject certificate, and PKCS

    10 (not published) generated by rootd. Be careful not to delete the

    root certificate or root key, those are //not// generated by rootd

these are the content of ../publication/altCA/1 ?

/usr/local/share/rpki/publication/altCA/1# l total 20 -rw-r--r-- 1 root wheel 1345 Nov 13 22:20 3T4fihnQdxX8Q1IGzosxEoQYMjU.cer -rw-r--r-- 1 root wheel 1280 Nov 13 22:14 QtQuBTaAD9bUfoS49ML7WYblA-c.cer -rw-r--r-- 1 root wheel 1788 Jan 8 23:49 jmuTD-ppv4jaYNs1KRWgcQ8sRdc.gbr -rw-r--r-- 1 root wheel 434 Jan 30 13:50 praQSFw521ZTbV9xdb8--UIzAA8.crl -rw-r--r-- 1 root wheel 1953 Jan 30 13:50 praQSFw521ZTbV9xdb8--UIzAA8.mft

Trac comment by randy on 2013-01-30T14:00:16Z

sraustein commented 11 years ago
  1. Manually remove the manifest, CRL, subject certificate, and PKCS

    10 (not published) generated by rootd. Be careful not to delete the

    root certificate or root key, those are //not// generated by rootd

these are the content of ../publication/altCA/1 ?

No, that's a directory maintained by rpkid.

rootd's outputs are in a directory specified deep in your rpki.conf. It's the same as the directory named by the SIA URI in your root certificate:

In fact, all of the files you need to delete are named in the rootd section of rpki.conf. I did mention that rootd is not very smart.

{{{ $ rsync rsync://ca0.rpki.net/rpki/ drwxr-xr-x 512 2013/01/30 08:54:15 . -rw-r--r-- 4878 2013/01/30 08:54:16 1.tgz -rw-r--r-- 1195 2013/01/23 22:14:06 altCA.cer -rw-r--r-- 420 2013/01/23 22:14:06 root.crl -rw-r--r-- 1685 2013/01/23 22:14:06 root.mft drwxr-xr-x 512 2012/11/13 17:14:53 altCA }}}

The files you need to delete are altCA.cer, root.crl, and root.mft. Most likely just deleting altCA.cer would force regeneration of all of them, but as long as you're doing this you might as well delete all three.

No idea what 1.tgz is, that's got to be something you did manually. If it's a saved copy of what used to be the 1/ subdirectory, please put the directory back.

Note that altCA.cer is not the same thing as root.cer, the latter is the one you generated manually with openssl.

Trac comment by sra on 2013-01-30T14:10:18Z

sraustein commented 11 years ago

the 1.tgz was a just in case

i killed the daemons /usr/local/share/rpki/publication# rm altCA.cer root.crl root.mft restarted daemons

rpkic -i IETF force_reissue

rpkic -i rgnet force_reissue

jay takes longer to process, gets four valid objects instead of one, but then gets an internal error on

publication/altCA/1/praQSFw521ZTbV9xdb8--UIzAA8.mft

Trac comment by randy on 2013-01-30T14:20:48Z

sraustein commented 11 years ago

rpkic -i IETF force_reissue

rpkic -i rgnet force_reissue

Entity names have to match what you have loaded. s/IETF/ietf/, since you appear to have entered it in lowercase originally, and you skipped altCA itself.

Updated rcynic detects the problem, at least it does for me.

Trac comment by sra on 2013-01-30T14:49:39Z

sraustein commented 11 years ago

the gooey tells me the child is IETF. but i did ietf

jay still whines

Trac comment by randy on 2013-01-30T15:03:46Z

sraustein commented 11 years ago

Sigh. I know what the problem is, or, rather, what the problems are.

  1. rootd is stupid, as I may have mentioned. It gets its idea of the current CRL number by reading the current root.crl; when it needs to generate a new one, it increments that by one. When you deleted the current root.crl, it reset the root.crl CRL number to one. Feh. My bad for telling you to delete root.crl as part of the cleanup for the earlier bug. There's a way to fix this too, but it's sick, and not worth bothering with until we fix the other problem. For the record, though, before I forget what it is, the kludge for this is something like this: {{{

    !sh

    for i in . . . . . do rm altCA.cer rpkic -i altCA force_reissue sleep 30 done }}} That is: force rootd to regenerate its child certificate five times, thus forcing new CRL and manifest each time, to push the CRL number up past what it was before you deleted it. Yes, this is sick. Did I mention that rootd is stupid?

  2. rpkid preserves the manifest certificate for reuse, because key generation is expensive. In retrospect, perhaps I should have just preserved the key and regenerated the manifest certificate every time. In any case, "rpkic force_reissue" doesn't force regeneration of the manifest certificate. It probably should. This will require a minor code fix.

Trac comment by sra on 2013-01-30T15:55:31Z

sraustein commented 11 years ago

just guessing, but i suspected you planned to use the $i for something. so how about the following:

{{{

!sh

for i in rgnet IETF altCA; do rm altCA.cer rpkic -i $i force_reissue sleep 30 done }}}

Trac comment by randy on 2013-01-31T00:54:53Z

sraustein commented 11 years ago

sigh. my hack to your hack did not work. tim still pukes on altCA's manifest, altCA/1/praQSFw521ZTbV9xdb8--UIzAA8.mft

Trac comment by randy on 2013-01-31T00:57:12Z

sraustein commented 11 years ago

just guessing, but i suspected you planned to use the $i for something.

Nope. Syntax requires it, not used otherwise.

Hint: Count the dots.

Trac comment by sra on 2013-01-31T01:29:59Z

sraustein commented 11 years ago

ok, i assumed the dots were elipses for the childrens' handles

i used yours literally. got a lot of {{{ rm: altCA.cer: No such file or directory rm: altCA.cer: No such file or directory rm: altCA.cer: No such file or directory }}} and tim still does not like the result

wanna increase the sleep?

Trac comment by randy on 2013-01-31T02:04:03Z

sraustein commented 11 years ago

Yeah, 240 or 300 seconds sleep might do it.

Trac comment by sra on 2013-01-31T02:47:17Z

sraustein commented 11 years ago

at 300s, no whining messages but tim still whines

Trac comment by randy on 2013-01-31T03:36:00Z

sraustein commented 11 years ago

this is still broken!!!!

Trac comment by randy on 2013-01-31T23:33:59Z

sraustein commented 11 years ago

this is still broken!!!!

I said it requires a patch. Packaging is higher priority, right?

Trac comment by sra on 2013-01-31T23:36:39Z

sraustein commented 11 years ago

sorry, missed that

Trac comment by randy on 2013-01-31T23:37:21Z

sraustein commented 11 years ago

In [changeset:5029]: {{{

!CommitTicketReference repository="" revision="5029"

Reissue CA's own manifest certificate when doing forced reissue of everything else. See #406. }}}

Trac comment by sra on 2013-02-05T17:44:52Z

sraustein commented 11 years ago

So part of the reason why the silly hack to force rootd to regenerate the CRL and manifest didn't work is that rpkic force_reissue only affects rpkid's relationship with its children, not with its parent(s). It turns out that we have no left-right protocol operation for "please force an immediate up-down list query to the parent", because there's usually no sane reason why we would want to do that. The 300 second sleep variation works because by that time the normal internal cron cycle has woken up and done an up-down list query, triggering the update.

There may be some way to force the required behavior directly with the testpoke up-down test client, but I suspect it would take me longer to work this out than it would just to let the 300 second timer expire in a loop.

Anyway: the reason the problem is not fixed yet is that we didn't bump the CRL number high enough. There were copies out there already with CRL and manifest numbers up at least as high as 12, and we only bumped the regenerated number up to 6. So I'm running the loop kludge to bump it up some more, which should, in theory, fix the immediate problem.

Remains the question of what rcynic should do in case where results are inconsistent (crl/mft dates indicate different ordering from crl/mft numbers, all signatures check). That one probably needs to be a discussion at least among the validator cabal if not also the WG.

Trac comment by sra on 2013-02-18T00:49:41Z

sraustein commented 11 years ago

OK, ca0.rpki.net/rpki/root.{crl,mft} numbers are now at 19. Does this solve the original problem?

I'm now seeing "tainted because not in manifest" warnings at https://rescert.psg.com/ca0.rpki.net.html, haven't tracked those down yet. I suspect that something in the set of commands we used before was generating new objects but not withdrawing old ones. Will figure out what this is and fix it, but as long as it's only affecting things like .gbr objects, it's not critical, as they use inheritance, thus can't have stale resources.

Trac comment by sra on 2013-02-18T01:45:16Z

sraustein commented 11 years ago

OK, ca0.rpki.net/rpki/root.{crl,mft} numbers are now at 19. Does this solve the original problem?

no. jay's install of tim's validator still says bad

randy

Trac comment by randy on 2013-02-19T18:11:26Z

sraustein commented 11 years ago

I am having difficulty reproducing this with rcynic. There was some lingering garbage from my inept attempts to invent a process for cleaning up after the original bug, but at this point I think I've dealt with all the resulting garbage, and rcynic appears to be running cleanly. Wait an hour for https://rescert.psg.com/ca0.rpki.net.html to catch up, for confirmation.

Generated manifest EE certificates look right to me. I might of course be missing something.

We may need to talk to Tim about what his code sees now.

Trac comment by sra on 2013-02-20T01:19:45Z

sraustein commented 11 years ago

jay, rob thinks the problem may be that your cache has retained some sick history. do you know how to clear it?

Trac comment by randy on 2013-02-20T01:26:23Z

sraustein commented 11 years ago

I haven't read the whole ticket (looks interesting). There isn't much to the set-up of the RIPE validator. Tim/Alex might know some surgical method to clear bad cached data. I don't, but I would be willing to stop my running instance, throw away the all subdirectories & files the validator has created, then start it up again fresh. I guess rather than actually throwing anything away, I should mv them aside some place where they won't be used but could be inspected later if desired.

If that's what you would like, let me know and I'll do it.

Trac comment by jayb on 2013-02-20T03:02:57Z

sraustein commented 11 years ago

would you mind gicing it a try? thanks

Trac comment by randy on 2013-02-20T03:19:02Z

sraustein commented 11 years ago

DONE. To confirm: I stopped my long-running RIPE RPKI Validator v2.7 on sunfire.cbbtier3.att.net, mv'ed the entire directory including the 2.7 software executables, configuration, repo, etc to "TOXIC", did a fresh 'unzip' of the bundle, copied over to the new install my arin.tal, altCA2.tal and altCA3.tal and restarted it. The restart occurred almost exactly at:

Wed Feb 20 03:30:00 UTC 2013

http://12.0.1.223:8080/trust-anchors still doesn't appear happy with either altCA instance (where they vary only in some formatting)

Trac comment by jayb on 2013-02-20T03:42:11Z

sraustein commented 11 years ago

Testing confirmed that RIPE's validator chokes on any object type it doesn't understand, and that the result of that choking is not just to reject the object (which would be reasonable and proper) but also to reject the manifest in which it was found and all objects under the CA that issued the manifest.

While we do not know for sure that this is the only problem remaining in Jay's validator, we do know that a single Ghostbusters record is enough to trigger this behavior, and altCA itself has a Ghostbusters record, so we expect RIPE's validator to reject almost the entire altCA repository until RIPE fixes this bug.

I'm going to close this ticket, since we've fixed the empty RFC 3779 extension problem that was the original subject of this ticket. Please reopen if you feel this was in error or if new data shows up suggesting that the remaining problem is not just the known RIPE bug.

Trac comment by sra on 2013-03-15T16:12:11Z

sraustein commented 11 years ago

Closed with resolution fixed