Closed gkeller2 closed 3 years ago
We've seen (from the repository side) rsyncd jgnoring the timeout because data was still trickling in (even though they were NOOP packets). I guess rsync may do the same in client mode.
If you want rsync to stop after a fixed amount of time you need to kill the process. From your description it seems like routinator is not killing the rsync process after timeout seconds. Instead, it likely provides the timeout argument to rsync which works differently.
Can you make a pcap of what is happening?
Routinator indeed just sets the --timeout
option.
Interestingly, the same happens when TWNIC is fetched via RRDP – we have a timeout on the HTTP client, too, and it gets ignored.
Looking at the rsync manpage, it states that --timeout
does ‘set I/O timeout in seconds.’ This indeed sound a bit like a timeout on each I/O call rather than on the whole process.
I’ll wrap the process in a timeout in the next release.
Let me know if you want me to give that release some extended testing.
I wish we could stop using rsync soon...
Not sure why your Routinator uses rsync for TWNIC. They support RRDP. But maybe that actually does timeout and Routinator falls back to rsync. It does that if (and only if) RRDP never succeeded.
Just for posterity: Does the log say something about TWNIC and RRDP before trying rsync?
Yes. Often, the server fails with RRDP:
2020-08-14T11:41:47.766-04:00 RRDP https://rrdp.twnic.tw/rrdp/notify.xml: Updating server
2020-08-14T11:41:51.478-04:00 RRDP https://rrdp.twnic.tw/rrdp/notify.xml: updating from snapshot.
2020-08-14T11:41:51.675-04:00 https://rrdp.twnic.tw/rrdp/71d39673-7062-4036-b080-957aa1b854b3/snapshot/1179586.xml: HTTP status client error (404 Not Found) for url (https://rrdp.twnic.tw/rrdp/71d39673-7062-4036-b080-957aa1b854b3/snapshot/1179586.xml)
2020-08-14T11:41:51.675-04:00 Marking RRDP server 'https://rrdp.twnic.tw/rrdp/notify.xml' as unusable
And another server that often fails is rpki.cnnic.cn
:
2020-08-14T11:43:06.179-04:00 https://rpki.cnnic.cn/rrdp/d4427960-abb3-49e4-902c-d815a518b867/snapshot/739262.xml: I/O error: error decoding response body: operation timed out
2020-08-14T11:43:06.183-04:00 Marking RRDP server 'https://rpki.cnnic.cn/rrdp/notify.xml' as unusable
FWIW, I have contacted APNIC at least a dozen times about the intermittent connectivity issues with TWNIC. They get in touch with their local contacts there, which usually leads to a temporary improvement. Still hoping for a more permanent solution to this situation.
Adding to the thread ... TWNIC has two IP addresses for rrdp.twnic.tw
(210.201.80.190
and 203.119.3.5
). If you measure the transfer speed on these two IP's you clearly see an issue - see script below.
The server at 203.119.3.5
is clearly the fast one. Or better said, the 210.201.80.190
server is painfully and consistently slow.
This is clearly a TWNIC issue and I've sent email to them today.
However, if you want a quick fix for today (and it will come back to bite you tomorrow), you can place 203.119.3.5 rrdp.twnic.tw
in your /etc/hosts
file.
$ mkdir /tmp/twnic
$ cd /tmp/twnic
$ ip_list=`host rrdp.twnic.tw | awk '{print $4}'` ; for count in 1 2 3 4 5; do for ip in $ip_list ; do echo $ip ; curl -O --resolve "rrdp.twnic.tw:443:$ip" "https://rrdp.twnic.tw/rrdp/71d39673-7062-4036-b080-957aa1b854b3/deltas/1179613.xml" ; ls -l 1179613.xml; done ; done
210.201.80.190
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 7184 0 0:01:38 0:01:38 --:--:-- 13652
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:43 1179613.xml
203.119.3.5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 280k 0 0:00:02 0:00:02 --:--:-- 280k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:43 1179613.xml
210.201.80.190
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 6110 0 0:01:55 0:01:55 --:--:-- 16568
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:45 1179613.xml
203.119.3.5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 323k 0 0:00:02 0:00:02 --:--:-- 323k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:45 1179613.xml
210.201.80.190
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 6807 0 0:01:43 0:01:43 --:--:-- 8603
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:46 1179613.xml
203.119.3.5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 144k 0 0:00:04 0:00:04 --:--:-- 166k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:46 1179613.xml
210.201.80.190
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 7167 0 0:01:38 0:01:38 --:--:-- 11857
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:48 1179613.xml
203.119.3.5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 159k 0 0:00:04 0:00:04 --:--:-- 170k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:48 1179613.xml
210.201.80.190
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 5449 0 0:02:09 0:02:09 --:--:-- 4615
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:50 1179613.xml
203.119.3.5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 690k 100 690k 0 0 210k 0 0:00:03 0:00:03 --:--:-- 210k
-rw-rw-r-- 1 martin martin 706991 Aug 14 18:50 1179613.xml
$
I investigated the TWNIC connection issues some more after noticing their RRDP serial number is incredibly high. Publishing RPKI data with RRDP uses snapshot and delta files. Every time an object is updated or resigned, a new delta is created and a serial number is increased by 1.
I did two validation runs: at 07:32:36 hrs the TWNIC serial was 1210467 and at 10:47:00 hrs it was 1211975. That's 1508 new delta files in 195 minutes, an average of 7.7 new delta files every minute.
If Relying Party Software is set to a 15 minute interval, it needs to fetch about a 100 delta files on a slow connection. This also explains why the timeout on RRDP doesn’t seem to work as it applies to individual HTTP requests. This may explain some of the issues we're seeing here.
(For reference, the RIPE NCC repository, the largest and busiest by far, has 188 deltas in the same time frame.)
In the past 7 to 10 days, both servers have "recovered" from the issues they were having. AFAICT, RRDP sync ups have stopped failing, therefore rsync sync ups weren't necessary -- the latter were the ones introducing the delays.
On Wed, Sep 2, 2020, 8:03 AM Alex Band notifications@github.com wrote:
I investigated the TWNIC connection issues some more after noticing their RRDP serial number is incredibly high. Publishing RPKI data with RRDP uses snapshots and deltas. Every time an object is updated or resigned, a new delta is created and a serial number is increased by 1.
I did two validation runs: at 07:32:36 hrs the TWNIC serial was 1210467 and at 10:47:00 hrs it was 1211975. That's 1508 new delta files in 195 minutes, an average of 7.7 new delta files every minute.
If Relying Party Software is set to a 15 minute interval, it needs to fetch about a 100 deltas, in order, on a slow connection. This also explains why the timeout on RRDP doesn’t seem to work as it applies to individual HTTP requests. This may explain some of the issues we're seeing here.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NLnetLabs/routinator/issues/367#issuecomment-685685923, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWWHCZZOATQPDLFALVXLUDSDYX7PANCNFSM4P7VMHDQ .
Thanks for Gaston Keller's testing.
In the past 7 to 10 days, both servers have "recovered" from the issues they were having. AFAICT, RRDP sync ups have stopped failing, therefore rsync sync ups weren't necessary -- the latter were the ones introducing the delays. … On Wed, Sep 2, 2020, 8:03 AM Alex Band @.**> wrote: I investigated the TWNIC connection issues some more after noticing their RRDP serial number is incredibly high. Publishing RPKI data with RRDP uses snapshots and deltas. Every time an object is updated or resigned, a new delta is created and a serial number is increased by 1. I did two validation runs: at 07:32:36 hrs the TWNIC serial was 1210467 and at 10:47:00 hrs it was 1211975. That's 1508 new delta files in 195 minutes, an average of 7.7 new delta files every minute*. If Relying Party Software is set to a 15 minute interval, it needs to fetch about a 100 deltas, in order, on a slow connection. This also explains why the timeout on RRDP doesn’t seem to work as it applies to individual HTTP requests. This may explain some of the issues we're seeing here. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#367 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWWHCZZOATQPDLFALVXLUDSDYX7PANCNFSM4P7VMHDQ .
Here is a question I am pondering: Would we rather have single synchronisation runs take forever but get all data or do we prefer to stop early but loose data?
I've been noticing some "slow" runs of routinator, which I've blamed on rsync. As you can see in the following log capture, we start running rsync for
rpkica.twnic.tw
at11:41:51
, see a 7-min pause in the output, and see the rsync run complete 10 min later at11:51:12
-- even though I've set rsync to timeout after 60 seconds:routinator -vv --rsync-timeout=60 vrps --format csvext --complete --output /root/vrps.csv
Is rsync really taking 10 min, thus ignoring the timeout, or is there something else going on behind the curtain that I'm missing?
Let me know if you need any more information and I'll see what I can provide.
Log capture: