AlexanderBand commented 5 years ago

Users of Routinator should be able to get confirmation through the monitoring endpoint that all data was retrieved and processed correctly.

ebais commented 5 years ago

As you can see, in the following example, the second session was 'stale' and didn't updated the (juniper) router with the latest state of the DB.

Inconsistencies can lead to incorrect rejection of (now) valid ROA's.

"show validation session
Session State Flaps Uptime #IPv4/IPv6 records 178.249.a.b Up 12 1w6d 20:09:07 71872/13121 178.249.c.d Up 7 1w6d 20:16:23 69882/12806"

ximon18 commented 5 years ago

Hi @ebais,

Forgive me if the following is incorrect, I have limited experience with the Routinator and with Juniper routers. Does this actually show a problem with the Routinator (presumably the RPKI source being connected to by both sessions) or could it just be showing a connectivity problem between the router and one of the two Routinator instances, i.e. could the c.d session have last connected to its Routinator instance less recently than the a.b session? Or do you have output from the two Routinator instances showing that their internal data differs to each other?

I work for NLnet Labs and as part of my work on the NLnet Labs Gantry project (which aims to test Routinator with virtual service routers) I did some limited testing with the Juniper VMX 18.2R1.9 VSR and the Routinator. In that testing I used the command show validation session detail to retrieve the serial number of the update (e.g. see this example).

For ease here is a quote from that example link showing the command output from the Juniper router (ignore the quotes, they come from the Ansible test framework being used, not from the router)

                "Session 157.230.84.67, State: up, Session index: 2",
                "  Group: routinator, Preference: 100",
                "  Port: 3323",
                "  Refresh time: 300s",
                "  Hold time: 600s",
                "  Record Life time: 3600s",
                "  Serial (Full Update): 0",
                "  Serial (Incremental Update): 0",
                "    Session flaps: 0",
                "    Session uptime: 00:00:32",
                "    Last PDU received: 00:00:29",
                "    IPv4 prefix count: 71596",
                "    IPv6 prefix count: 12976"

Does your Juniper router support the command show validation session detail? If so, what does it show? I'm wondering if, like the Juniper VMX example here, that you can see the serial number of the session (though I'm not clear if that serial number is something captured from the Routinator or just an incremented counter on the router)

I'd be interested to know exactly which Juniper model you are using. We could then consider whether it makes sense to test against that specific version of if we should update the Gantry test to run with two connected Routinators for a longer time period to replicate this kind of session variance.

Thanks,

Ximon

ebais commented 5 years ago

There was a difference in the serial in this particular case, yes.

The correct session had a serials of 254, while the stale session was stuck at 34 :

show validation session detail Session 178.249.a.b, State: up, Session index: 2 Group: rpki-validator, Preference: 100 Local IPv4 address: 5.10.a.b, Port: 3323 Refresh time: 120s Hold time: 240s Record Life time: 3600s Serial (Full Update): 254 Serial (Incremental Update): 254 Session flaps: 5 Session uptime: 1w6d 20:08:41 Last PDU received: 00:00:38 IPv4 prefix count: 71872 IPv6 prefix count: 13121 Session 178.249.c.d, State: up, Session index: 3 Group: rpki-validator, Preference: 100 Local IPv4 address: 5.10.c.d, Port: 3323 Refresh time: 120s Hold time: 240s Record Life time: 3600s Serial (Full Update): 34 Serial (Incremental Update): 34 Session flaps: 5 Session uptime: 1w6d 20:15:58 Last PDU received: 00:01:01
IPv4 prefix count: 69882 IPv6 prefix count: 12806

We are using vMX as well.

Model: vmx Junos: 16.1R3.10

ximon18 commented 5 years ago

Thanks @ebais!

partim commented 5 years ago

122 adds three new gauges to the monitoring data that show how long ago an update was last started, how long ago an update last finished successfully, and how long the last update took:

# HELP last_update_start seconds since last update started
# TYPE gauge
last_update_start 20

# HELP last_update_duration duration in seconds of last update
# TYPE gauge
last_update_duration 17

# HELP last_update_done seconds since last update finished
# TYPE gauge
last_update_done 2

Using this data, an alert can be set up when the last update was started longer than --refresh seconds ago.

That’s what I had in mind for this ticket. Hope that’s enough?

AlexanderBand commented 5 years ago

These new gauges work well in Prometheus/Grafana and allow setting up alerts when validation takes overly long. Thanks!

NLnetLabs / routinator

Ensure data completeness #115

122 adds three new gauges to the monitoring data that show how long ago an update was last started, how long ago an update last finished successfully, and how long the last update took: