Closed reuteran closed 7 years ago
Can confirm that this is reproducible. Don't know why yet.
Results so far look disturbingly like a Django caching problem. That is: instead of writing through to SQL, serial number updates to the database are stopping at the Django caching layer, after which things get confused, with old cached values being used instead of the updated value from SQL. Explicitly bypassing the Django caching layer helps some but does not completely fix the problem. Not sure why.
If I had to guess, this somehow ties back to the abominations we had to perform in the tasking layer to allow us to update tens of thousands of ROAs at once while trying to survive another of Randy's test scenarios: the control structure needed to survive that is somewhat contorted, and, to the extent that I understand what's happening here at all, the contorted control structure may be interacting badly with the Django caching code. Feh.
OK, new commit which may solve the problem. Kind of scary if it does. Give it an hour or so to run through the build robot.
Underlying problem seems to have been Django's caching code being entirely too clever.
thanks! @reuteran will test. ... but I'm also a bit puzzled. ... why do you need Django at all?
All of the database code is written using the Django ORM, among other reasons because doing so gives us SQL engine portability and migrations. Other tools would have solved the same problem, but we were already into Django for the GUI, so just using Django for all of it seemed simpler.
So the new version has been running for a good while now and so far it seems like the problem is fixed! No invalid ROAs anymore. Thanks
Glad that fixed it, even if the fix was scary.
I run a cronjob that requests a set of ROAs at 04:00 UTC and another set at 12:00 UTC, all for the same prefixes. A number of these ROAs are invalid, the problem being that the EE certificate was revoked. I confirmed this by looking into the .crl. I did some experimenting and wrote down the resulting ROAs. First I withdrew all existing ROAs by using 'rpkic load_roa_requests' with an empty file.
I then requested these ROAs at around 13:16 UTC: 147.28.240.0/24-24 47065 147.28.241.0/24-24 47065 147.28.242.0/24-24 47065 147.28.243.0/24-24 47065 147.28.244.0/24-24 47065 147.28.245.0/24-24 47065 147.28.246.0/24-24 47065 147.28.247.0/24-24 47065 147.28.248.0/24-24 47065 147.28.249.0/24-24 47065
The ROAs got published, with 2 of them being invalid because of a revoked EE cert. This table shows the prefix, serial number of the EE cert and the signing time in the ROA:
Since this has been happening for a few days, there existed already earlier ROAs with serial numbers 2047, 2048. These were withdrawn at some point yesterday but the CA still issued their serial numbers to the new ones.
I then requested these ROAs:
147.28.240.0/24-24 47065 147.28.241.0/24-24 51224 147.28.242.0/24-24 47065 147.28.243.0/24-24 47065 147.28.244.0/24-24 51224 147.28.245.0/24-24 51224 147.28.246.0/24-24 47065 147.28.247.0/24-24 51224 147.28.248.0/24-24 47065 147.28.249.0/24-24 51224
for the same prefixes, with some of them having a different AS number and some of them staying the same. This was in the resulting ROAs:
There seems to be some kind of mix up with the serial numbers where they are given to a new ROA even though they were already used in a previous ROA that is now revoked. There is also a case where the same serial number was given to two different ROAs (nr 2052).
I am running version 'buildbot-1.0.1484492702' I've attached the truncated rpkid logs, starting with the withdrawal of all existing ROAs:
tr_rpkid.txt