Closed lamehost closed 1 year ago
I'll investigate how hard this will be and update you!
I assume this is handled here: https://github.com/bgp/stayrtr/blob/13659dd27e1b792dd2a7b9f439ef0a4159d862d9/lib/server.go#L81-L100
Would have to follow the logic to see if this path is actually hit though.
Serial Query
with a serial that's higher than what the cache currently has, the cache sends back a Cache Reset
message.Serial Query
) asks for a serial that's too far in the past (too low), the cache also sends back a Cache Reset
message.From the CSCvp8228716 bug report its not clear what the cache can do to help the router recover its state, other than perhaps gratuitously sending a Cache Reset
message every X hours?
@lamehost and I have long-running PCAP dumps to collect more data on this issue. Might take another week to get sufficient data.
A better description of CSCvp8228716 would also help, maybe. The current one does not make sense to me, I don't see why routinator would trigger the cache reset.
A pcap from routinator might help? Because why does that reset, if stayrtr sends cache reset correctly (as per @job's description above)
Symptom: Route Origin Authorizations (ROA) database on Cisco routers is removed when TCP connectivity to the RPKI server session goes down. After TCP connectivity recovers and RPKI session gets re-established. But the RPKI-RTR protocol does not always go into RESET state, so in some cases it goes into REFRESH state. And with REFRESH it may not download all the ROAs from the RPKI server.
If the after RPKI session re-establishes and the RPKI-RTR protocol goes into RESET state, then the ROAs are all successfully downloaded.
Conditions: Issue happens when the RPKI session goes down, due to TCP connectivity to RPKI server being lost.
Workaround: None
Further Problem Description:
clear bgp rpki server <>
can be used to fix the issue if the ROA download has not happened.
The problem is exacerbated by the fact that you can't specify the source interface or address of the RTR session on XR boxes, so the operator can't assign a stable interface such as a loopback interface to serve as source. Instead, the topologically 'closest-by' interface IP is picked, often a router2router linknet, which during the course of normal operations might flap once in a while due to long-haul fiber maintenance.
FWIW: I worked around this by moving the session to SSH based transport and then specifying a source interface for the SSH client. This has the added benefit of cryptographic protection between RTR server and client/router.
See: https://beufa.net/blog/rpki-use-routinator-rtr-cache-validator-cisco-ios-xr/
I believe this is solved by d5be6983b58172b01e077988df3cc6f2e86e3cd8
I'll prep a minor release after I fix a handful of other small things
Another consideration: the KeepDifference
value currently is set to a (very modest) value of 3 https://github.com/bgp/stayrtr/blob/f8b0c87ec8a3f5e57415b0edbcd8d23ebc2e3a96/cmd/stayrtr/stayrtr.go#L622 - this means that if a router requests Serial 5 - while the cache is at serial 10, the difference is 'too big' and the cache will instruct the router to refresh completely. Assuming the cache is refreshed once an hour, this means that clients that connect primed with information older than 3 hours will fully synchronize.
Do real routers support/do real world configurations enable the automatic update after the Serial Notify
PDU? If so, I would not expect desynchronisation to happen often.
Do real routers support/do real world configurations enable the automatic update after the
Serial Notify
PDU?
Yes
Dear developers,
We're running staryrtr to send ROAs to routers running IOS-XR versions affected by CSCvp82287.
Along with stayrtr, we also run routinator and we have noticed that the effects of CSCvp82287 are way smaller for the sessions established with the latter. And after long debugging, we have discovered that routinator sends Cache Reset when the Serial Number requested by the client is too old. And that that mitigates the inconsistent states describe by CSCvp82287.
Here's an example from a controlled environment with rtr_client and routinator running in a container.
Even though i understand is IOS-XR's rather than stayrtr's fault, i believe that sending Cache Reset when serial is too old is a much better behavior for the server. And it also in line with RFC8210
Thank you