cloudflare / gortr

The RPKI-to-Router server used at Cloudflare
https://rpki.cloudflare.com
BSD 3-Clause "New" or "Revised" License
309 stars 39 forks source link

checktime does not invalidate obsolete VRPs #81

Open lukastribus opened 4 years ago

lukastribus commented 4 years ago

checktime only verifies the valid timestamp in the json metadata when a json file is actually read.

However existing stale VRP's are never invalided.

level=info msg="New update (167437 uniques, 167437 total prefixes). 0 bytes. Updating sha256 hash  -> 03d0470277dacbc625855408b54f8de72e1cd6867c8660d2791034de09daa09d"
level=info msg="Updated added, new serial 0"
level=info msg="GoRTR Server started (sessionID:53928, refresh:3600, retry:600, expire:7200)"
level=info msg="File /var/lib/rpki-client/json is identical to the previous version"
level=info msg="File /var/lib/rpki-client/json is identical to the previous version"
level=error msg="Error updating: File is expired: 2020-08-15 18:26:39 +0000 UTC"
level=error msg="Error updating: File is expired: 2020-08-15 18:26:39 +0000 UTC"
level=error msg="Error updating: File is expired: 2020-08-15 18:26:39 +0000 UTC"
level=error msg="Error updating: open /var/lib/rpki-client/json: no such file or directory"

None of those events invalided the obsolete VRPs that gortr still caches:

I assume the same issues also applies to json files hosted on a HTTP server.

This is concerning for me because a number of realistic issues will lead to a situation where gortr keeps sending bogus VRP's to the routers.

I think that checking valid when reading the json is not enough, but the timestamp needs to be saved and if expired, needs to invalidate the entire cache.

lspgn commented 4 years ago

Hi @lukastribus, Yes this behavior was implemented on purpose to provide resilience in case of a temporary validator issue. Few things to mention:

The choice was made to ensure continuity in Origin Validation rather than bouncing routes/VRPs (+ added complexity on RTR serial management in some cases).

The attack scenario I took into account while writing the software: it would be to break the link between GoRTR and the validator and then create a hijack on a previously signed resource. In the optimal security model, with distributed GoRTR close to the network device, the RTR sessions are maintained and the previously valid data is still used until the validator comes back up, leaving time for an engineer to figure it out and "freezing the state". Prometheus metrics should be used to alert on an unexpected behavior (eg: more than 1hr since last valid update).

The JSON only contains VRPs valid at a point in time. We are assuming that one hour later will likely be the same. This is due to the whole asynchronicity of the framework. The resulting file does not include all the expiration corresponding to the certificates per VRP. An organization/user can control the generation of the JSON and its distribution. If the distribution process not correspond to the generation process (eg: not meant to be refreshed periodically), the timers/checks can be disabled inside GoRTR.

A possibility would be to fail-close (GoRTR stops upon an invalid file) and let the network devices figure it out: eg: using their RTR cache. I do not think a fail-open would be ideal (withdrawing all the VRPs upon invalid file). I want to keep the current behavior (fail-safe) as default, but I don't mind adding a CLI argument to switch between the three options. Unfortunately, I do not have time to implement this but I would accept PRs.

lukastribus commented 4 years ago

Yeah I can see how that design choice makes a lot of sense for Cloudflare (when you have multiple senior *nix people in different timezones available 24/7), but I'm concerned about the people setting up a validator and RTR server without proper monitoring and alerting, and then when a failure occurs updates stop and we have stale VRP's on the routers. This could then remain undiscovered, and lead to big problems on production networks.

I like your fail-close proposal, that makes a lot of sense in my mind.

I'm not sure I'm able to implement the changes myself, but I will definitely take a look and try.

Thanks!

job commented 4 years ago

@lspgn can you in more detail describe the (current default?) 'fail-safe' behavior? It is not clear to me how exactly it is fail safe.

fail-close on invalid file makes sense to me, a daemon killing itself when the config file is corrupt is not weird

lspgn commented 4 years ago

@job I based my explanation off the definition of fail-safe. Causing minimal harm to other equipments in my opinion means that it prefers missing a recent update and keep everything running in the same conditions as before until repaired.

job commented 4 years ago

So the current behavior is to keep waiting until a valid file can be fetched again, and until then keep the current state? (I guess we can call that fail-stale? :-)

job commented 4 years ago

making it configurable will be a nice addition!

lukastribus commented 3 years ago

rpki-client 7.1 emits a new per VRP attribute: expires, indicating when the VRP should be dropped as a unix timestamp.

Effort is underway to support this in stayrtr (a gortr fork): https://github.com/bgp/stayrtr/pull/6