NLnetLabs / routinator

An RPKI Validator and RTR server written in Rust
https://nlnetlabs.nl/projects/routing/routinator/
BSD 3-Clause "New" or "Revised" License
470 stars 71 forks source link

rsync ignoring timeout setting, or something else? #367

Closed gkeller2 closed 3 years ago

gkeller2 commented 4 years ago

I've been noticing some "slow" runs of routinator, which I've blamed on rsync. As you can see in the following log capture, we start running rsync for rpkica.twnic.tw at 11:41:51, see a 7-min pause in the output, and see the rsync run complete 10 min later at 11:51:12 -- even though I've set rsync to timeout after 60 seconds:

routinator -vv --rsync-timeout=60 vrps --format csvext --complete --output /root/vrps.csv

Is rsync really taking 10 min, thus ignoring the timeout, or is there something else going on behind the curtain that I'm missing?

Let me know if you need any more information and I'll see what I can provide.

Log capture:

2020-08-14T11:41:51.675-04:00   rsync://rpkica.twnic.tw/rpki: Running command "rsync" "--contimeout=10" "--timeout=60" "-rltz" "--delete" "rsync://rpkica.twnic.tw/rpki/" "/root/.rpki-cache/repository/rsync/rpkica.twnic.tw/rpki/"

2020-08-14T11:41:58.436-04:00   RRDP https://rpki.cnnic.cn/rrdp/notify.xml: updating from snapshot.

2020-08-14T11:43:06.179-04:00   https://rpki.cnnic.cn/rrdp/d4427960-abb3-49e4-902c-d815a518b867/snapshot/739262.xml: I/O error: error decoding response body: operation timed out

2020-08-14T11:43:06.183-04:00   Marking RRDP server 'https://rpki.cnnic.cn/rrdp/notify.xml' as unusable

2020-08-14T11:43:06.183-04:00   rsyncing from rsync://rpki.cnnic.cn/rpki/.

2020-08-14T11:43:06.187-04:00   rsync://rpki.cnnic.cn/rpki: Running command "rsync" "--contimeout=10" "--timeout=60" "-rltz" "--delete" "rsync://rpki.cnnic.cn/rpki/" "/root/.rpki-cache/repository/rsync/rpki.cnnic.cn/rpki/"

2020-08-14T11:43:53.840-04:00   rsync://rpki.cnnic.cn/rpki: successfully completed.

2020-08-14T11:43:53.954-04:00   rsync://rpki.cnnic.cn/rpki/A9162E3D0000/577/Bnno7GYZdNImrE2NgOtA7VopH8Q.roa: processing failed.

2020-08-14T11:43:53.958-04:00   rsync://rpki.cnnic.cn/rpki/A9162E3D0000/577/HYSK6j9Pm4k6HObs3ehVIuOsemE.roa: processing failed.

2020-08-14T11:43:53.960-04:00   rsync://rpki.cnnic.cn/rpki/A9162E3D0000/577/JBqQNzyD7FX-A-kfxLTc29DLKUk.roa: processing failed.

2020-08-14T11:43:53.977-04:00   rsync://rpki.cnnic.cn/rpki/A9162E3D0000/2698/i6aybu3ysSm7Y9huPb4WO2Ck_nY.roa: processing failed.

2020-08-14T11:43:53.991-04:00   rsyncing from rsync://rpki-repository.nic.ad.jp/ap/.

2020-08-14T11:43:53.992-04:00   rsync://rpki-repository.nic.ad.jp/ap: Running command "rsync" "--contimeout=10" "--timeout=60" "-rltz" "--delete" "rsync://rpki-repository.nic.ad.jp/ap/" "/root/.rpki-cache/repository/rsync/rpki-repository.nic.ad.jp/ap/"

2020-08-14T11:43:58.090-04:00   rsync://rpki-repository.nic.ad.jp/ap: successfully completed.

 >>>>>> 7-min FREEZE

2020-08-14T11:51:12.012-04:00   rsync://rpkica.twnic.tw/rpki: successfully completed.
ties commented 4 years ago

We've seen (from the repository side) rsyncd jgnoring the timeout because data was still trickling in (even though they were NOOP packets). I guess rsync may do the same in client mode.

If you want rsync to stop after a fixed amount of time you need to kill the process. From your description it seems like routinator is not killing the rsync process after timeout seconds. Instead, it likely provides the timeout argument to rsync which works differently.

Can you make a pcap of what is happening?

partim commented 4 years ago

Routinator indeed just sets the --timeout option.

Interestingly, the same happens when TWNIC is fetched via RRDP – we have a timeout on the HTTP client, too, and it gets ignored.

partim commented 4 years ago

Looking at the rsync manpage, it states that --timeout does ‘set I/O timeout in seconds.’ This indeed sound a bit like a timeout on each I/O call rather than on the whole process.

I’ll wrap the process in a timeout in the next release.

gkeller2 commented 4 years ago

Let me know if you want me to give that release some extended testing.

I wish we could stop using rsync soon...

partim commented 4 years ago

Not sure why your Routinator uses rsync for TWNIC. They support RRDP. But maybe that actually does timeout and Routinator falls back to rsync. It does that if (and only if) RRDP never succeeded.

Just for posterity: Does the log say something about TWNIC and RRDP before trying rsync?

gkeller2 commented 4 years ago

Yes. Often, the server fails with RRDP:

2020-08-14T11:41:47.766-04:00   RRDP https://rrdp.twnic.tw/rrdp/notify.xml: Updating server

2020-08-14T11:41:51.478-04:00   RRDP https://rrdp.twnic.tw/rrdp/notify.xml: updating from snapshot.

2020-08-14T11:41:51.675-04:00   https://rrdp.twnic.tw/rrdp/71d39673-7062-4036-b080-957aa1b854b3/snapshot/1179586.xml: HTTP status client error (404 Not Found) for url (https://rrdp.twnic.tw/rrdp/71d39673-7062-4036-b080-957aa1b854b3/snapshot/1179586.xml)

2020-08-14T11:41:51.675-04:00   Marking RRDP server 'https://rrdp.twnic.tw/rrdp/notify.xml' as unusable

And another server that often fails is rpki.cnnic.cn:

2020-08-14T11:43:06.179-04:00   https://rpki.cnnic.cn/rrdp/d4427960-abb3-49e4-902c-d815a518b867/snapshot/739262.xml:  I/O error: error decoding response body: operation timed out

2020-08-14T11:43:06.183-04:00   Marking RRDP server 'https://rpki.cnnic.cn/rrdp/notify.xml' as unusable
AlexanderBand commented 4 years ago

FWIW, I have contacted APNIC at least a dozen times about the intermittent connectivity issues with TWNIC. They get in touch with their local contacts there, which usually leads to a temporary improvement. Still hoping for a more permanent solution to this situation.

mahtin commented 4 years ago

Adding to the thread ... TWNIC has two IP addresses for rrdp.twnic.tw (210.201.80.190 and 203.119.3.5). If you measure the transfer speed on these two IP's you clearly see an issue - see script below.

The server at 203.119.3.5 is clearly the fast one. Or better said, the 210.201.80.190 server is painfully and consistently slow.

This is clearly a TWNIC issue and I've sent email to them today.

However, if you want a quick fix for today (and it will come back to bite you tomorrow), you can place 203.119.3.5 rrdp.twnic.tw in your /etc/hosts file.

$ mkdir /tmp/twnic
$ cd /tmp/twnic
$ ip_list=`host rrdp.twnic.tw | awk '{print $4}'` ; for count in 1 2 3 4 5; do for ip in $ip_list ; do echo $ip ; curl -O --resolve "rrdp.twnic.tw:443:$ip" "https://rrdp.twnic.tw/rrdp/71d39673-7062-4036-b080-957aa1b854b3/deltas/1179613.xml" ; ls -l 1179613.xml; done ; done
210.201.80.190
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   7184      0  0:01:38  0:01:38 --:--:-- 13652
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:43 1179613.xml
203.119.3.5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   280k      0  0:00:02  0:00:02 --:--:--  280k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:43 1179613.xml
210.201.80.190
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   6110      0  0:01:55  0:01:55 --:--:-- 16568
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:45 1179613.xml
203.119.3.5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   323k      0  0:00:02  0:00:02 --:--:--  323k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:45 1179613.xml
210.201.80.190
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   6807      0  0:01:43  0:01:43 --:--:--  8603
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:46 1179613.xml
203.119.3.5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   144k      0  0:00:04  0:00:04 --:--:--  166k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:46 1179613.xml
210.201.80.190
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   7167      0  0:01:38  0:01:38 --:--:-- 11857
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:48 1179613.xml
203.119.3.5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   159k      0  0:00:04  0:00:04 --:--:--  170k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:48 1179613.xml
210.201.80.190
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   5449      0  0:02:09  0:02:09 --:--:--  4615
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:50 1179613.xml
203.119.3.5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  690k  100  690k    0     0   210k      0  0:00:03  0:00:03 --:--:--  210k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:50 1179613.xml
$
AlexanderBand commented 4 years ago

I investigated the TWNIC connection issues some more after noticing their RRDP serial number is incredibly high. Publishing RPKI data with RRDP uses snapshot and delta files. Every time an object is updated or resigned, a new delta is created and a serial number is increased by 1.

I did two validation runs: at 07:32:36 hrs the TWNIC serial was 1210467 and at 10:47:00 hrs it was 1211975. That's 1508 new delta files in 195 minutes, an average of 7.7 new delta files every minute.

If Relying Party Software is set to a 15 minute interval, it needs to fetch about a 100 delta files on a slow connection. This also explains why the timeout on RRDP doesn’t seem to work as it applies to individual HTTP requests. This may explain some of the issues we're seeing here.

(For reference, the RIPE NCC repository, the largest and busiest by far, has 188 deltas in the same time frame.)

gkeller2 commented 4 years ago

In the past 7 to 10 days, both servers have "recovered" from the issues they were having. AFAICT, RRDP sync ups have stopped failing, therefore rsync sync ups weren't necessary -- the latter were the ones introducing the delays.

On Wed, Sep 2, 2020, 8:03 AM Alex Band notifications@github.com wrote:

I investigated the TWNIC connection issues some more after noticing their RRDP serial number is incredibly high. Publishing RPKI data with RRDP uses snapshots and deltas. Every time an object is updated or resigned, a new delta is created and a serial number is increased by 1.

I did two validation runs: at 07:32:36 hrs the TWNIC serial was 1210467 and at 10:47:00 hrs it was 1211975. That's 1508 new delta files in 195 minutes, an average of 7.7 new delta files every minute.

If Relying Party Software is set to a 15 minute interval, it needs to fetch about a 100 deltas, in order, on a slow connection. This also explains why the timeout on RRDP doesn’t seem to work as it applies to individual HTTP requests. This may explain some of the issues we're seeing here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NLnetLabs/routinator/issues/367#issuecomment-685685923, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWWHCZZOATQPDLFALVXLUDSDYX7PANCNFSM4P7VMHDQ .

Ching-Heng commented 4 years ago

Thanks for Gaston Keller's testing.

In the past 7 to 10 days, both servers have "recovered" from the issues they were having. AFAICT, RRDP sync ups have stopped failing, therefore rsync sync ups weren't necessary -- the latter were the ones introducing the delays. On Wed, Sep 2, 2020, 8:03 AM Alex Band @.**> wrote: I investigated the TWNIC connection issues some more after noticing their RRDP serial number is incredibly high. Publishing RPKI data with RRDP uses snapshots and deltas. Every time an object is updated or resigned, a new delta is created and a serial number is increased by 1. I did two validation runs: at 07:32:36 hrs the TWNIC serial was 1210467 and at 10:47:00 hrs it was 1211975. That's 1508 new delta files in 195 minutes, an average of 7.7 new delta files every minute*. If Relying Party Software is set to a 15 minute interval, it needs to fetch about a 100 deltas, in order, on a slow connection. This also explains why the timeout on RRDP doesn’t seem to work as it applies to individual HTTP requests. This may explain some of the issues we're seeing here. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#367 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWWHCZZOATQPDLFALVXLUDSDYX7PANCNFSM4P7VMHDQ .

partim commented 4 years ago

Here is a question I am pondering: Would we rather have single synchronisation runs take forever but get all data or do we prefer to stop early but loose data?