RIPE-NCC / rpki-validator-3

RIPE NCC RPKI Validator 3
Other
63 stars 27 forks source link

Stale Data with RPKI Validator 3 #275

Open alkhos opened 3 years ago

alkhos commented 3 years ago

Hello, I have RIPE RPKI Validator 3 deployed on a number of VMs running Ubuntu 18.04 using the debain instructions in Wiki. It seems that we are having a couple of issues with the validator as we have it ran there for a while now:

  1. Once in a while, the servers stop getting new data. I can see this by monitoring http:///api/trust-anchors/statuses and noticing that "lastUpdated" is lagging behind the current time by the matter of days. This situation goes away by restarting the rpki validator ( systemctl restart rpki-validator-3 ). But I was wondering if anybody has had a similar issue and if so, what has been the cause of it?

  2. Our servers, are also deviating in terms of # errors, warnings, and even successful count in the same "trust-anchors/statuses" when compared to the ripe's public server (https://rpki-validator.ripe.net/). I can see that a log of these are errors in these categories ( as seen in the validation runs API )

Also almost all of these errors can be tracked to RRDP repositories ( and not the RSYNC ones ). We run 8/20 build for reference. Is there any reason for such deviation? or are there specific things that we have to note in the configuration to avoid this situations?

JvGinkel commented 3 years ago

I had the same, for some reason my validator last update was almost two months ago. Today I updated to the latest version and did some OS updates after the restart it's up2date again. I added a monitoring check that will alert me again if the cache is older than 5 days.

wibisono commented 3 years ago

Hi,

After the build 8/20 we have made releases that fix potential deadlock e.g relase 9/18. The stale updates might be related to this issue that we occasionally encounter, hopefully upgrading to latest version will solve this issue.

Please let us know if the problem persist after updating to the latest release.

lukastribus commented 3 years ago

I added a monitoring check that will alert me again if the cache is older than 5 days.

5 day old VRP's on a ROV enabled production network is way too much. Please consider dropping the alert threshold to something like 2 hours instead. What if your validator is buggy and instead of 5 minutes a validation run takes 4 hours? You would never know.

ROV is supposed to converge fairly quickly, "days" is not a term we should ever have to use ...

ties commented 3 years ago

5 day old VRP's on a ROV enabled production network is way too much. Please consider dropping the alert threshold to something like 2 hours instead. What if your validator is buggy and instead of 5 minutes a validation run takes 4 hours? You would never know.

ROV is supposed to converge fairly quickly, "days" is not a term we should ever have to use ...

In practice a TA may not update for an extended period, causing spurious alerts (in my experience only on APNIC's TA0). This prometheus alert works pretty well for me (no false positives and no alerts since the last release):

alert: ValidatorDown
expr: time() - rpkivalidator_last_validation_run{trust_anchor!="rsync://rpki-as0.apnic.net/repository/APNIC-AS0-AP/apnic-rpki-root-as0-origin.cer"} > 3600
for: 5m
labels:
  severity: critical
annotations:
  summary: Trust anchor {{ $labels.trust_anchor }} has not updated for 60 minutes.

We are discussing some changes for improved scheduling of validation and quicker convergence (and bootstrapping), which should also improve reliability. This may end up in one of our next sprints.