bgp / stayrtr

RPKI-To-Router server implementation in Go
BSD 3-Clause "New" or "Revised" License
91 stars 13 forks source link

Evict stale VRPs if (buildtime+24h < now() || expired) #15

Closed job closed 1 year ago

job commented 3 years ago

The following snippet prevents us from loading a stale VRP file: https://github.com/bgp/stayrtr/blob/master/cmd/stayrtr/stayrtr.go#L269-L278

However, this check does not clear the previous state. The check helps prevent loading new stale data, but does not guard against VRPs in the current cache which became stale.

Two checks should probably be added to help prevent routing based on stale data:

ties commented 3 years ago

I have one question about this approach: Do we have an estimate about how close to each other VRPs from the same CA expire?

In the following situation:

The VRPs from the second ROA expiring could cause downtime. I think with routinator this is called "unsafe VRPs".

Do we want these VRPs to expire at their 'natural' time? I'm torn.

job commented 3 years ago

It is up to the CA to decide when products expire. If a CA chooses to sign things in such a way that one ROA expires at moment X, and another ROA expires at moment Y, all we can do is honor their wishes precisely.

ties commented 3 years ago

True. It is what was published at some point in time - so it is CA intent.

randomthingsandstuff commented 2 years ago

Expiry is not required in VRPJson and is not parsed on receipt. Do we expect that a VRP will be checked on receipt for expiration?

Also, our update loop is timer-based, so we'd need a time to scan for the expired VRPs, nuke them, etc. How often should they get looked at?

ties commented 2 years ago

Some RPs (at least the ripe validator did) scheduled a scan right after the first expiry (so no periodic scans, but calculate the timeout and rescan at that point in time if that is earlier than the next scheduled scan) - and rate limited that because it becomes a trivial DOS vector.

I think a similar approach would apply here

job commented 2 years ago

In OpenBGPd we scan the VRP table every 3 minutes to see if anything expired. I think that’s sufficiently fine grained for practical purposes.

ties commented 2 years ago

Agreed that a few minutes is a sensible rate limit.

And since you can easily calculate when the next object expires I would calculate it instead of scanning at a set interval.

randomthingsandstuff commented 2 years ago

I've had lots of time to think about the timing..

If the interval is based on calculation, it needs to be an offset thing. Otherwise, we will end up with absurd result like:

a) VRPs {x} expires at time t0, removed at t0 b) VRPs {y} expire at t1 = t0 + 1, removed at t1 + min_interval

Only VRPs {x} will get nuked, then minimum interval has to pass to get around for VRPs {y}.

Because I don't want to expire early, I'm proposing the initial expiry time at

t0+min_interval such that everything is delayed by min_interval.

ties commented 2 years ago

Only VRPs {x} will get nuked, then minimum interval has to pass to get around for VRPs {y}.

Because I don't want to expire early, I'm proposing the initial expiry time at

t0+min_interval such that everything is delayed by min_interval.

This may be a cause for very interesting outages. In practice objects are generated with timestamps that are really close to each other.

If there are multiple ASNs for a specific prefix they will be different ROA files with expire within seconds (system created) or longer (user creates the individual objects).

The json input does not contain the information needed to batch objects per CA and grouping their expiry that way (only implicitly assuming one prefix has one CA).

Ideally you would expire all VRPs for one prefix and the less specifics of that prefix at once if "close enough to each other". If they are a meaningful time apart (let's say bigger than half the interval between input updates) I would expire them at separate moments (and accept the potential loss of visibility).

"if a VRP for the same prefix or a less specific expires within [time threshold] of this VRP, expire them all at the same time".

"but it is the CA intent" implies "extremely tricky to debug outages that last a very short interval".

job commented 2 years ago

It still is CA intent. An RP should not attempt to "fix up" data. The CA is in full control to avoid "tricky to debug outages", by for example re-issuing manifests and CRLs in a timely manner, or by automatically renewing ROAs.

benjojo commented 1 year ago

A basic version of this is done, where entries that have expired will be removed every refresh time, even if the backend file/URI data has not changed.

It could be improved later on to check expiry more often, but for now I figure this works good enough