dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
54 stars 30 forks source link

New ROAs have EE cert. revoked #855

Closed reuteran closed 7 years ago

reuteran commented 7 years ago

I run a cronjob that requests a set of ROAs at 04:00 UTC and another set at 12:00 UTC, all for the same prefixes. A number of these ROAs are invalid, the problem being that the EE certificate was revoked. I confirmed this by looking into the .crl. I did some experimenting and wrote down the resulting ROAs. First I withdrew all existing ROAs by using 'rpkic load_roa_requests' with an empty file.

I then requested these ROAs at around 13:16 UTC: 147.28.240.0/24-24 47065 147.28.241.0/24-24 47065 147.28.242.0/24-24 47065 147.28.243.0/24-24 47065 147.28.244.0/24-24 47065 147.28.245.0/24-24 47065 147.28.246.0/24-24 47065 147.28.247.0/24-24 47065 147.28.248.0/24-24 47065 147.28.249.0/24-24 47065

The ROAs got published, with 2 of them being invalid because of a revoked EE cert. This table shows the prefix, serial number of the EE cert and the signing time in the ROA:

Prefix Serial num. Revoked Signing time
147.28.240.0/24 2054 No 2017-01-26T13:16:07.000Z
147.28.241.0/24 2056 No 2017-01-26T13:16:08.000Z
147.28.242.0/24 2047 Yes 2017-01-26T13:16:04.000Z
147.28.243.0/24 2048 Yes 2017-01-26T13:16:05.000Z
147.28.244.0/24 2050 No 2017-01-26T13:16:06.000Z
147.28.245.0/24 2051 No 2017-01-26T13:16:06.000Z
147.28.246.0/24 2049 No 2017-01-26T13:16:05.000Z
147.28.247.0/24 2055 No 2017-01-26T13:16:08.000Z
147.28.248.0/24 2052 No 2017-01-26T13:16:06.000Z
147.28.249.0/24 2053 No 2017-01-26T13:16:07.000Z

Since this has been happening for a few days, there existed already earlier ROAs with serial numbers 2047, 2048. These were withdrawn at some point yesterday but the CA still issued their serial numbers to the new ones.

I then requested these ROAs:

147.28.240.0/24-24 47065 147.28.241.0/24-24 51224 147.28.242.0/24-24 47065 147.28.243.0/24-24 47065 147.28.244.0/24-24 51224 147.28.245.0/24-24 51224 147.28.246.0/24-24 47065 147.28.247.0/24-24 51224 147.28.248.0/24-24 47065 147.28.249.0/24-24 51224

for the same prefixes, with some of them having a different AS number and some of them staying the same. This was in the resulting ROAs:

Prefix Serial num. Revoked Signing time
147.28.240.0/24 2054 No 2017-01-26T13:16:07.000Z
147.28.241.0/24 2053 Yes 2017-01-26T13:54:23.000Z
147.28.242.0/24 2047 Yes 2017-01-26T13:16:04.000Z
147.28.243.0/24 2048 Yes 2017-01-26T13:16:05.000Z
147.28.244.0/24 2049 No 2017-01-26T13:54:22.000Z
147.28.245.0/24 2050 Yes 2017-01-26T13:54:22.000Z
147.28.246.0/24 2049 No 2017-01-26T13:16:05.000Z
147.28.247.0/24 2052 No 2017-01-26T13:54:23.000Z
147.28.248.0/24 2052 No 2017-01-26T13:16:06.000Z
147.28.249.0/24 2051 Yes 2017-01-26T13:54:23.000Z

There seems to be some kind of mix up with the serial numbers where they are given to a new ROA even though they were already used in a previous ROA that is now revoked. There is also a case where the same serial number was given to two different ROAs (nr 2052).

I am running version 'buildbot-1.0.1484492702' I've attached the truncated rpkid logs, starting with the withdrawal of all existing ROAs:

tr_rpkid.txt

sraustein commented 7 years ago

Can confirm that this is reproducible. Don't know why yet.

sraustein commented 7 years ago

Results so far look disturbingly like a Django caching problem. That is: instead of writing through to SQL, serial number updates to the database are stopping at the Django caching layer, after which things get confused, with old cached values being used instead of the updated value from SQL. Explicitly bypassing the Django caching layer helps some but does not completely fix the problem. Not sure why.

If I had to guess, this somehow ties back to the abominations we had to perform in the tasking layer to allow us to update tens of thousands of ROAs at once while trying to survive another of Randy's test scenarios: the control structure needed to survive that is somewhat contorted, and, to the extent that I understand what's happening here at all, the contorted control structure may be interacting badly with the Django caching code. Feh.

sraustein commented 7 years ago

OK, new commit which may solve the problem. Kind of scary if it does. Give it an hour or so to run through the build robot.

Underlying problem seems to have been Django's caching code being entirely too clever.

waehlisch commented 7 years ago

thanks! @reuteran will test. ... but I'm also a bit puzzled. ... why do you need Django at all?

sraustein commented 7 years ago

All of the database code is written using the Django ORM, among other reasons because doing so gives us SQL engine portability and migrations. Other tools would have solved the same problem, but we were already into Django for the GUI, so just using Django for all of it seemed simpler.

reuteran commented 7 years ago

So the new version has been running for a good while now and so far it seems like the problem is fixed! No invalid ROAs anymore. Thanks

sraustein commented 7 years ago

Glad that fixed it, even if the fix was scary.