bgp / stayrtr

RPKI-To-Router server implementation in Go
BSD 3-Clause "New" or "Revised" License
95 stars 14 forks source link

Please consider to send "Cache Reset" when client Serial Number is too old (CSCvp8228716) #82

Closed lamehost closed 1 year ago

lamehost commented 1 year ago

Dear developers,

We're running staryrtr to send ROAs to routers running IOS-XR versions affected by CSCvp82287.

Along with stayrtr, we also run routinator and we have noticed that the effects of CSCvp82287 are way smaller for the sessions established with the latter. And after long debugging, we have discovered that routinator sends Cache Reset when the Serial Number requested by the client is too old. And that that mitigates the inconsistent states describe by CSCvp82287.

Here's an example from a controlled environment with rtr_client and routinator running in a container.

(venv) marco@lilith:~/CSCvp8228713/rpki-rtr-client$ rtr_client -h 127.0.0.1 -p 323 -S 58407 -s 3
2023-01-27-171415: CONNECT localhost.323
+
2023-01-27-171415: NEW SESSION ID 58407
....withdraw(84.32.25.0/24, 207279, None) - failed
withdraw(45.166.128.0/22, 267957, 24) - failed
withdraw(2602:fb26:900::/48, 23470, None) - failed
withdraw(191.7.72.0/21, 53130, 24) - failed
withdraw(2804:141c::/32, 53130, None) - failed
withdraw(187.120.240.0/20, 53130, 24) - failed
withdraw(181.214.170.0/23, 205474, 24) - failed
withdraw(88.216.18.0/24, 50225, None) - failed
withdraw(2804:55cc::/32, 267957, 48) - failed
withdraw(5.105.131.0/24, 204384, None) - failed

2023-01-27-171415: SESSION 58407 NEW SERIAL 3->12     <--- New serial is 12 (routinator sends Cache Response + Data)

(venv) marco@lilith:~/CSCvp8228713/rpki-rtr-client$ rtr_client -h 127.0.0.1 -p 323 -S 58407 -s 2
2023-01-27-171416: CONNECT localhost.323
+
2023-01-27-171416: NEW SESSION ID 58407
.
2023-01-27-171416: SESSION 58407 NEW SERIAL 2->0       <--- New serial is 0 (routinator sends Cache Reset)

Even though i understand is IOS-XR's rather than stayrtr's fault, i believe that sending Cache Reset when serial is too old is a much better behavior for the server. And it also in line with RFC8210

     If the Serial Numbers in the old
      and new sessions are different enough, the cache will respond to
      the router's Serial Query with a Cache Reset, which will solve the
      problem.

Thank you

benjojo commented 1 year ago

I'll investigate how hard this will be and update you!

ties commented 1 year ago

I assume this is handled here: https://github.com/bgp/stayrtr/blob/13659dd27e1b792dd2a7b9f439ef0a4159d862d9/lib/server.go#L81-L100

Would have to follow the logic to see if this path is actually hit though.

job commented 1 year ago

From the CSCvp8228716 bug report its not clear what the cache can do to help the router recover its state, other than perhaps gratuitously sending a Cache Reset message every X hours?

@lamehost and I have long-running PCAP dumps to collect more data on this issue. Might take another week to get sufficient data.

ties commented 1 year ago

A better description of CSCvp8228716 would also help, maybe. The current one does not make sense to me, I don't see why routinator would trigger the cache reset.

A pcap from routinator might help? Because why does that reset, if stayrtr sends cache reset correctly (as per @job's description above)

job commented 1 year ago

Symptom: Route Origin Authorizations (ROA) database on Cisco routers is removed when TCP connectivity to the RPKI server session goes down. After TCP connectivity recovers and RPKI session gets re-established. But the RPKI-RTR protocol does not always go into RESET state, so in some cases it goes into REFRESH state. And with REFRESH it may not download all the ROAs from the RPKI server.

If the after RPKI session re-establishes and the RPKI-RTR protocol goes into RESET state, then the ROAs are all successfully downloaded.

Conditions: Issue happens when the RPKI session goes down, due to TCP connectivity to RPKI server being lost.

Workaround: None

Further Problem Description: clear bgp rpki server <> can be used to fix the issue if the ROA download has not happened.

The problem is exacerbated by the fact that you can't specify the source interface or address of the RTR session on XR boxes, so the operator can't assign a stable interface such as a loopback interface to serve as source. Instead, the topologically 'closest-by' interface IP is picked, often a router2router linknet, which during the course of normal operations might flap once in a while due to long-haul fiber maintenance.

lukastribus commented 1 year ago

FWIW: I worked around this by moving the session to SSH based transport and then specifying a source interface for the SSH client. This has the added benefit of cryptographic protection between RTR server and client/router.

See: https://beufa.net/blog/rpki-use-routinator-rtr-cache-validator-cisco-ios-xr/

benjojo commented 1 year ago

I believe this is solved by d5be6983b58172b01e077988df3cc6f2e86e3cd8

I'll prep a minor release after I fix a handful of other small things

job commented 1 year ago

Another consideration: the KeepDifference value currently is set to a (very modest) value of 3 https://github.com/bgp/stayrtr/blob/f8b0c87ec8a3f5e57415b0edbcd8d23ebc2e3a96/cmd/stayrtr/stayrtr.go#L622 - this means that if a router requests Serial 5 - while the cache is at serial 10, the difference is 'too big' and the cache will instruct the router to refresh completely. Assuming the cache is refreshed once an hour, this means that clients that connect primed with information older than 3 hours will fully synchronize.

ties commented 1 year ago

Do real routers support/do real world configurations enable the automatic update after the Serial Notify PDU? If so, I would not expect desynchronisation to happen often.

job commented 1 year ago

Do real routers support/do real world configurations enable the automatic update after the Serial Notify PDU?

Yes