RIPE-NCC / rpki-validator-3

RIPE NCC RPKI Validator 3
Other
63 stars 27 forks source link

Differences in processed items from trust anchors from one validator to the other #128

Open racompton opened 4 years ago

racompton commented 4 years ago

Hello, I have two RPKI validators set up on the same subnet with the same access to the Internet. They have the same OS/software/config. The only difference between them is the IPs (both servers are dual stacked). One of my validators looks very similar to what is showing on https://rpki-validator.ripe.net/trust-anchors but the other shows a status of "Failed" when trying to connect to https://rrdp.apnic.net/notification.xml. On the server that is showing "Failed", I am able to manually do "wget https://rrdp.apnic.net/notification.xml" so it doesn't seem to be a connectivity issue. Is there a way to manually force an update or anything else I can try? I'm also getting warning showing "Manifest next update time is in the past, local clock may be off" on both boxes but they are both set to UTC. I see these errors on https://rpki-validator.ripe.net so I'm assuming that it's an issue with the dates on the RIR's manifest and not the validators.

racompton commented 4 years ago

So my procedure for fixing this is: Stop the rpki-validator service: "sudo systemctl stop rpki-validator-3.service" Delete all the .xd files in /var/lib/rpki-validator-3/db: "sudo rm /var/lib/rpki-validator-3/db/.xd" Start the rpki-validator service: "sudo systemctl start rpki-validator-3.service " Re-install the ARIN TAL: "upload-tal.sh arin-rfc7730.tal http://localhost:8080/" Check the web UI after 30 mins or so to see if things are fixed.

I'd like to put in a feature request for issues like this to be resolved in a more automated way in comparison to having to manually determine that there is an issue and then manually perform this procedure.

lolepezy commented 4 years ago

Could you please clarify: did the validator without connectivity started to connect after you removed the database and restarted it? Do you use proxy?

lolepezy commented 4 years ago

The reason for this behaviour is, I believe, that downloading the repository snapshot () from APNIC takes 15 minutes:

$ wget https://rrdp.apnic.net/4ea5d894-c6fc-4892-8494-cfd580a414e3/128129/snapshot.xml snapshot.xml 22%[===============> ] 5.40M 24.2KB/s eta 14m 18s

We will have a look what can we do about it in the validator.

racompton commented 4 years ago

Could you please clarify: did the validator without connectivity started to connect after you removed the database and restarted it? Do you use proxy?

Yes, it started to connect after I removed the database and restarted it. No, I don't use a proxy.

racompton commented 4 years ago

FYI, I created a script to fix the validator when it gets out of wack.(https://github.com/racompton/restart-validator/blob/master/restart-validator.sh) which will stop the validator, delete all the database files, start the validator and then load in the ARIN TAL.

zzaflemi commented 4 years ago

I know this all depends on how you have the software deployed, but I think you just put a copy of the ARIN (or any other tal) in the preconfigured-tals directory. Then when the service starts up and creates a new database, it will just load it along with the other included tal. At least that has worked for me the many times I have deleted the DB in the past.