cloudflare / gortr

The RPKI-to-Router server used at Cloudflare
https://rpki.cloudflare.com
BSD 3-Clause "New" or "Revised" License
309 stars 39 forks source link

Mismatch between output.json from octorpki and gortr #82

Open skakee opened 3 years ago

skakee commented 3 years ago

This morning we noticed that one the ROAs was wrong. When I pulled output.json from octorpki (/output.json), it had the correct value. However when I ran rtrdump, it had different ASN value for the prefix. Restarting gortr process did fix it. Sending SIGHUP did not. This issue was noticed t 8:00 AM and persisted until 10:45 when I restarted gortr process. Please note that "-refresh" is set 120.

Can you comment on what could have been the possible problem and how it could have been addressed short of restarting the process?

Thanks!

/opt/cloudflare/go/bin/gortr -metrics.addr :8281 -ssh.bind :8282 -ssh.key /opt/cloudflare/private.pem -ssh.method.password=true -ssh.auth.user REDACTED -ssh.auth.password REDACTED -verify=false -cache http://localhost:8080/output.json -refresh=120 -bind

lspgn commented 3 years ago

Did you save the JSON? What was the affected prefix and both ASNs? How long was it running? Was it present on GoRTR's JSON endpoint? Was it also wrong inside the router? Another suggestion could be to also verify using rpki-rtr-client.

skakee commented 3 years ago

I was hoping the updated version of gortr would fix this issue, but it did not. We are now running v0.14.6 and it happened again. I did save output.json from octorpki and gortr. There were about ~5000 differences. I am not sure how long gortr had been not serving updated output.json (next time I will, I added that check). Below is an example of a diff for one prefix.


< octorpki
> gorrtr

< {"prefix":"103.109.71.0/24","maxLength":24,"asn":"AS51928"}
---
> {"prefix":"103.109.71.0/24","maxLength":24,"asn":"AS137458"}
> {"prefix":"103.109.71.0/24","maxLength":24,"asn":"AS138570"}
skakee commented 3 years ago

An update. It happened again on one of the 8 rpki servers. From what I see in the syslog gortr stopped requesting updates from octorpki. Pease see below, searching "Updated added, new serial"; stopped after 08:09:49 and did not resume until after restart at 13:51:55.


Oct 22 07:26:34 rpki01-hhc gortr[1018077]: time="2020-10-22T07:26:34Z" level=info msg="Updated added, new serial 43"
Oct 22 07:47:31 rpki01-hhc gortr[1018077]: time="2020-10-22T07:47:31Z" level=info msg="Updated added, new serial 44"
Oct 22 08:09:49 rpki01-hhc gortr[1018077]: time="2020-10-22T08:09:49Z" level=info msg="Updated added, new serial 45"
Oct 22 13:51:55 rpki01-hhc gortr[1084175]: time="2020-10-22T13:51:55Z" level=info msg="Updated added, new serial 0"
Oct 22 14:02:17 rpki01-hhc gortr[1084175]: time="2020-10-22T14:02:17Z" level=info msg="Updated added, new serial 1"
Oct 22 14:25:17 rpki01-hhc gortr[1084175]: time="2020-10-22T14:25:17Z" level=info msg="Updated added, new serial 2"
skakee commented 3 years ago

Happened once again. Interestingly, happened on the same server as previously, which defies probability. This sever has more router connections than the other's. Is that a clue?

lspgn commented 3 years ago

Are you setting -useserial parameter? It almost looks like it restarted according to the logs. It might indeed be related to the serial calculation and diffs. How many sessions are we talking about? I will see if I can reproduce. What's the version of OctoRPKI?

skakee commented 3 years ago

Q: Are you setting -useserial parameter? A: gortr is running with these: /opt/cloudflare/go/bin/gortr -metrics.addr :8281 -ssh.bind :8282 -ssh.key /opt/cloudflare/private.pem -ssh.method.password=true -ssh.auth.user rpki -ssh.auth.password REDACTED -verify=false -cache http://localhost:8080/output.json -refresh=120 -bind

C: It almost looks like it restarted according to the logs. A: I restarted gortr. This issue (gortr serving stale info) has happened several time and I have never seen it "recover". Restating the process does address the issue.

Q: How many sessions are we talking about? A: Shame on me not specifying it... 2.600 ssh sessions.

Q: What's the version of OctoRPKI? A: Binary download (so does not report version with --version option) on February 2020.

lspgn commented 3 years ago

Thank you for the feedback. Another question: are you tracking metrics into Prometheus?

skakee commented 3 years ago

We are not tracking the metrics in Prometeus. I was panning to start tracking the metrics yesterday, but did not get to it. I wish I had because it happened again. Curiously again on the same server.

Tracking the metrics is the next step an I will get on that.

lukastribus commented 3 years ago

I'm currently testing a nagios-compatible monitoring script (checking for RTR serial updates), I will release it in a few days once testing is complete.

lspgn commented 3 years ago

I have been building a small tool that can monitor two servers (json, rtr or both) and outputs a diff and Prometheus metrics. Once the travis build is finished, it should appear on: https://github.com/cloudflare/gortr/releases/tag/v0.14.7-pre

You can run it

$ docker run -ti -p 8080:8080 cloudflare/rtrmon -primary.host tcp://rtr.rpki.cloudflare.com:8282 -secondary.host https://rpki.cloudflare.com/rpki.json
$ curl --silent http://localhost:8080/diff.json | jq '.'
{
  "metadata-primary": {
    "last-fetch": 1603930794,
    "url": "tcp://rtr.rpki.cloudflare.com:8282",
    "serial": 635,
    "session-id": 40872,
    "count": 191018,
    "rtr-refresh": 3600,
    "rtr-retry": 600,
    "rtr-expire": 7200
  },
  "metadata-secondary": {
    "last-fetch": 1603930793,
    "url": "https://rpki.cloudflare.com/rpki.json",
    "serial": 0,
    "session-id": 0,
    "count": 191018,
    "rtr-refresh": 0,
    "rtr-retry": 0,
    "rtr-expire": 0
  },
  "only-primary": [],
  "only-secondary": []
}
$ curl http://localhost:8080/metrics
[...]
# TYPE rpki_roas gauge
rpki_roas{server="primary",type="diff",url="tcp://rtr.rpki.cloudflare.com:8282"} 0
rpki_roas{server="primary",type="total",url="tcp://rtr.rpki.cloudflare.com:8282"} 191018
rpki_roas{server="secondary",type="diff",url="tcp://rtr.rpki.cloudflare.com:8282"} 0
rpki_roas{server="secondary",type="total",url="tcp://rtr.rpki.cloudflare.com:8282"} 191018
# HELP rtr_serial Serial of the RTR session.
# TYPE rtr_serial gauge
rtr_serial{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 635
rtr_serial{server="secondary",url="https://rpki.cloudflare.com/rpki.json"} 0
# HELP rtr_session ID of the RTR session.
# TYPE rtr_session gauge
rtr_session{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 40872
rtr_session{server="secondary",url="https://rpki.cloudflare.com/rpki.json"} 0
# HELP rtr_state State of the RTR session (up/down).
# TYPE rtr_state gauge
rtr_state{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 1
# HELP update Timestamp of last update.
# TYPE update gauge
update{server="primary",url="tcp://rtr.rpki.cloudflare.com:8282"} 1.603930794e+09
update{server="secondary",url="https://rpki.cloudflare.com/rpki.json"} 1.603930793e+09

@skakee would you be able to give it a try?

skakee commented 3 years ago

So I am all for trying out rtrmon to try to troubleshoot the issue, but I would like to use it against production that is running the recommended versions of the binaries. Here's why...

So yes, I can definitely give it a try. But before that, I'd like to get on the same page as to what version of octorpki and gortr we should be running. We don't use docker, we use go binaries.

Now... by default, octorpki used to bind to port 8080 and gortr also to port 8080 for http, but we changed it at initial installation with "-metrics.addr :8281" option. Then there was a change made where octorpki by default would bind to 8081, not interfere with gortr's default port. This was no issue of us since we were not using the default port 8080 for gortr. So, in our installation octorpki uses port 8080 and gortr 8281, for http.

I see that the current binary release of octorpki, v1.1.4, still binds by default to port 8080. Only the pre-release version, v1.2.0-pre, binds to port 8081. If v1.1.4 is "legacy", and only that version is causing the issue (maybe?), there is little value in troubleshooting "legacy" code.

Is 1.2.0 stable and should it made available as general release?

Similarly, there is now a pre-release version of gortr available, v0.14.7-pre. Again, same argument, if v0.14.6 is soon to be "legacy", is there value in troubleshooting it for this issue?

Which versions of binary-for-linux of gortr and octorpki is recommended to be running in production?

Also, This issue is very intermittent. Since I first reported it, it occured 4 times on one of the 8 hosts and once on the other. The two hosts are "sisters" in that they are at the same location and same set of routers connect to them. That set of servers had 2,611 connections. Another three sites have set of servers with 2,660, 774 and 825 connections. This issues has not occured in the last two weeks at those location.

The reason I am mentioning it is that the test with rtrmon may not reveal anything if I run it in dev with a handful of connections.

So I am all for trying out rtrmon to try to troubleshoot the issue, but I would like to use it against production that is running the recommended versions of the binaries.

lspgn commented 3 years ago

So I am all for trying out rtrmon to try to troubleshoot the issue, but I would like to use it against production that is running the recommended versions of the binaries. Here's why...

So yes, I can definitely give it a try. But before that, I'd like to get on the same page as to what version of octorpki and gortr we should be running. We don't use docker, we use go binaries.

Understood. Fwiw: packages and containers should use the same code but they are compiled in different environments which can influence in some cases.

Now... by default, octorpki used to bind to port 8080 and gortr also to port 8080 for http, but we changed it at initial installation with "-metrics.addr :8281" option. Then there was a change made where octorpki by default would bind to 8081, not interfere with gortr's default port. This was no issue of us since we were not using the default port 8080 for gortr. So, in our installation octorpki uses port 8080 and gortr 8281, for http.

I see that the current binary release of octorpki, v1.1.4, still binds by default to port 8080. Only the pre-release version, v1.2.0-pre, binds to port 8081. If v1.1.4 is "legacy", and only that version is causing the issue (maybe?), there is little value in troubleshooting "legacy" code.

Ah yes, I ended up messing up the default ports, since I had a few feedbacks from people wanting to run this software on the same machine and decided to fix as part of v1.2.0. A few things changed with exploration and validation between 1.1.x and 1.2.x. We should consider v1.1.4 legacy and focus on 1.2.x. Although if the issue is on RTR, I doubt octorpki would be the cause.

Is 1.2.0 stable and should it made available as general release?

There have been a few additions since 1.2.0-pre. It took a while to rollout due to testing. I believe it is stable and I just added the tag for the release. It's in the queue for Travis builds.

Similarly, there is now a pre-release version of gortr available, v0.14.7-pre. Again, same argument, if v0.14.6 is soon to be "legacy", is there value in troubleshooting it for this issue?

gortr, rtrdump and rtrmon are part of the same repository. v0.14.7 version only introduce rtrmon and library structure change. There wasn't any new fixes or features for gortr since v0.14.6. If this is affecting your current setup, I would keep troubleshooting v.0.14.6.

Which versions of binary-for-linux of gortr and octorpki is recommended to be running in production?

I recommend using gortr v0.14.6 and octorpki v1.2.0. We tested the new version of the validator in our staging environment and it was recently rolled out in production (behind rpki.cloudflare.com/rpki.json).

Also, This issue is very intermittent. Since I first reported it, it occured 4 times on one of the 8 hosts and once on the other. The two hosts are "sisters" in that they are at the same location and same set of routers connect to them. That set of servers had 2,611 connections. Another three sites have set of servers with 2,660, 774 and 825 connections. This issues has not occured in the last two weeks at those location.

The reason I am mentioning it is that the test with rtrmon may not reveal anything if I run it in dev with a handful of connections.

So I am all for trying out rtrmon to try to troubleshoot the issue, but I would like to use it against production that is running the recommended versions of the binaries.

Thank you! Metrics collected regularly should definitely help the investigation.

skakee commented 3 years ago

This is a fantastic response. Thank you! I hope you understand I wanted to clear this up as testing legacy stuff would server neither of us. :)

I'll deploy octorpki v1.2.0 as soon as GA is available and start trying out rtrmon.

Thanks again.

skakee commented 3 years ago

I see that octorpki is available as of 16 minutes ago. Starting on the updates!

skakee commented 3 years ago

The plan was to update octorpki to v1.2.0 all the RPKI server then gather stats with rtrmon. I never get got to the second part. Updating octorpki to v1.2.0 caused gortr to keep disconnecting all the router every couple of hours on all the servers. I don't know why. I have the logs from /var/log/messages. I had to roll back octorpki to v1.1.4 on all the servers. Will keep monitoring the situation.

lspgn commented 3 years ago

Did the routers show an error? It could be a wrong PDU. How is the JSON distributed? After the routers are disconnected: can they reconnect immediately? Could you dump the JSON periodically?

skakee commented 3 years ago

I'm having trouble with rtrmon using ssh. To verify my parameters were OK, I used the same parameters with rtrdump and it works file. Please see below:

./rtrdump-v0.14.7-pre-linux-x86_64 -type ssh -file /tmp/rtrdump.$$ -connect $host:8282 -ssh.auth.user $user -ssh.auth.password $pass -ssh.method password
./rtrmon-v0.14.7-pre-linux-x86_64 -primary.host ssh://$host:8282 -primary.ssh.auth.user $user -primary.ssh.auth.password $pass --primary.ssh.method password

rtrmon returns:


INFO[0000] 1: Connected to server x.x.x.x:8282 via ssh. Fingerprint: SHA256:qxxxx
FATA[0000] ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain
lspgn commented 3 years ago

Could the --primary.ssh.method instead of -primary.ssh.method cause issue?

skakee commented 3 years ago

A good catch. But, alas, that did not fix it. The line now:


./rtrmon-v0.14.7-pre-linux-x86_64 -primary.host ssh://$host:8282 -primary.ssh.auth.user $user -primary.ssh.auth.password $pass -primary.ssh.method password
lspgn commented 3 years ago

I just had a look at the code, I missed some parts that handle SSH. Will come up with a fix in a few. Thanks!

lspgn commented 3 years ago

Can you try v0.14.7-pre2? (it may take some time with travis to build, I can provide you with a binary though) It should be fixed in https://github.com/cloudflare/gortr/commit/2aa2e5e44cb2978d543d8f9b04851c124d2f01ce.

rtrmon-v0.14.7-pre2-linux-x86_64.zip

skakee commented 3 years ago

Thank you! rtrmon-v0.14.7-pre2-linux-x86_64 connected fine using ssh/password. I'll update ocotrpki on one of our 8 RPKI servers. I'll setup periodic (10 min) monitoring of current and updated servers for comparison. Have a good feeling we will nail the issue. :)

skakee commented 3 years ago

Yesterday, I upgraded octorpki to v1.2.0 on each server one-by-one. There have been no issues at all, which is puzzling. Last Friday, if you recall, I was getting disconnects on all the servers few hours after the upgrade (same version as) which caused me to roll back the upgrades.

It's causing me to believe the issue last Friday was something unrelated and coincidental to the upgrade. I will keep monitoring the situation.

Also, octorpki v1.2.0 seems to have improved memory management over the previous version, looking at the early results.

lspgn commented 3 years ago

Yesterday, I upgraded octorpki to v1.2.0 on each server one-by-one. There have been no issues at all, which is puzzling. Last Friday, if you recall, I was getting disconnects on all the servers few hours after the upgrade (same version as) which caused me to roll back the upgrades.

It's causing me to believe the issue last Friday was something unrelated and coincidental to the upgrade. I will keep monitoring the situation.

I have a preference for restarting the validator on a clean cache directory. Could be a cause.

Also, octorpki v1.2.0 seems to have improved memory management over the previous version, looking at the early results.

I did a change on how it handled file paths, it definitely improved the usage. Our internal validator shows around 2GB of RAM during validation and <300MB the rest of the time).

lspgn commented 3 years ago

@skakee are you still experiencing the issue or should we close this ticket?

lukastribus commented 3 years ago

I'm currently testing a nagios-compatible monitoring script (checking for RTR serial updates), I will release it in a few days once testing is complete.

FYI this is here: https://github.com/lukastribus/rtrcheck