alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
57 stars 14 forks source link

rsync error when synchronising mirrors #286

Closed jemrobinson closed 5 years ago

jemrobinson commented 5 years ago

rsync failed twice when pushing updates from ExternalPyPI to InternalPyPI.

Is it possible to catch and recover from these types of error?

NB. the PyPI mirror is much bigger than the CRAN one, so errors are more likely to affect it

jemrobinson commented 5 years ago

Aside - testing rsync transfer times

Results

sending incremental file list
10M.txt
     10,485,760 100%   26.10MB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 10,491,816 bytes  received 35 bytes  1,907,609.27 bytes/sec
total size is 10,485,760  speedup is 1.00

real    0m5.234s
user    0m0.354s
sys     0m0.027s

Summary

Size Time
10MB 5.2s
100MB 6.5s
1GB 1m
10GB 10m

What about lots of small files?

for i in $(seq 1 1000); do
    dd if=/dev/urandom of=$i.txt bs=1M count=1
done
time rsync -prtlzv --delete --progress /datadrive/mirrordaemon/transfer_tests/* mirrordaemon@10.1.0.20:/datadrive/mirrordaemon/transfer_tests

Result

sent 1,049,224,326 bytes  received 633,452 bytes  29,573,458.54 bytes/sec
total size is 1,048,576,000  speedup is 1.00

real    0m35.480s
user    0m35.187s
sys     0m3.733s

Are we network-bound or CPU/disk bound?

Tasks: 140 total,   3 running,  70 sleeping,   0 stopped,   0 zombie
%Cpu(s): 21.7 us,  0.8 sy,  0.0 ni, 77.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  8166660 total,   179232 free,   367624 used,  7619804 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7484420 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  866 mirrord+  20   0   14428   2848   2132 R  76.1  0.0   0:18.98 rsync
  867 mirrord+  20   0   47060   5460   4788 R  15.3  0.1   0:03.03 ssh
Tasks: 141 total,   2 running,  71 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.5 us,  1.1 sy,  0.0 ni, 69.8 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
KiB Mem :  8166660 total,   166624 free,   371784 used,  7628252 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7480260 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  756 mirrord+  20   0   14428   2936   2236 R 100.0  0.0   0:10.04 rsync
  757 mirrord+  20   0   49320   8260   5196 S  18.1  0.1   0:01.92 ssh
  762 mirrord+  20   0   44532   3968   3356 R   2.9  0.0   0:00.19 top

Switch to a larger machine

Tasks: 697 total,   1 running, 346 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.6 us,  0.0 sy,  0.0 ni, 99.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 14852920+total, 69836416 free,  2755740 used, 75937040 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14461584+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3499 mirrord+  20   0   14428   3020   2312 S  37.4  0.0   0:10.77 rsync
 3500 mirrord+  20   0   47200   6184   5288 S   5.0  0.0   0:01.46 ssh

Don't compress files 🥇

rsync limitations

From here

rsync will only copy one chunk of data at a time.

martintoreilly commented 5 years ago

@jemrobinson How confident are we that the bandersnatch synchronisation from the public PyPI site to the external mirror is complete?

If it is more reliable than rsync, then could we also use it for our external->internal mirror sync? I note that bandersnatch also supports a blacklist and whitelist, which could be great for our Tier 3 mirrors.

However, bandersnatch looks like it is designed to be run from the internal mirror in our arrangement, which we would not want.

jemrobinson commented 5 years ago

I think that a way to incorporate bandersnatch could be for the Tier3 external mirror to be populated from the Tier2 mirror using bandersnatch's whitelist, but for the push to the internal mirror to remain the same. This would involve requiring that there's always a Tier2 mirror (or perhaps we could call this a "full mirror") available, whenever there's Tier3 data, which might be overkill.

martintoreilly commented 5 years ago

I like this idea a lot 👍

martintoreilly commented 5 years ago

I think it works for the likely majority use case where one SHM supports a range of DSGs.