Thomas-Moore-Creative / CSIRO-NCI-data-best-practice

open repo to help us formulate best practice techniques and codes for data management between CSIRO and NCI systems
GNU General Public License v3.0
2 stars 0 forks source link

Flagging and recovering from `parallel rsync` failures #3

Open Thomas-Moore-Creative opened 4 years ago

Thomas-Moore-Creative commented 4 years ago

Flagging and recovering from parallel rsync failures?

Overview

The approach of using parallel rsync to transfer large datasets from NCI to CSIRO has yeilded speeds at least an order of magnitude greater than previous experience.

However the method relies on numerous streams, each with it's own rsync that can fail.

How can we confidently alert the user to these failures and then recover from them and, for this use case, with the ongoing transfer to tape in mind?

Example: a 97 file, 11TB transfer with failures

command: time cat /datastore/d/dcfp/NCI_file_lists/cut_f6_2012_filelist.txt | parallel -j 10 --results /datastore/d/dcfp/logs/ 'rsync -ailPW --log-file="/datastore/d/dcfp/logs/f6_2012_rsync.log.$(date +%Y%m%d%H%m%S)" -e "ssh -T -c aes128-ctr" $USER@gadi-dm.nci.org.au:/scratch/v14/$USER/tar_tmp/f6.WIP.c5-d60-pX-f6-20121101.20200831_153624/{} /datastore/d/dcfp/CAFE/forecasts/f6/'

--results /datastore/d/dcfp/logs/ saves a directory structure of log files according to the GNU parallel docs here: https://www.gnu.org/software/parallel/

cd /datastore/d/dcfp/logs/1
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20111101.top_level.20200831_165650.tar> ls
seq  stderr  stdout

How do we know there's been a failure?

We happen to see it in the command line output (this is not robust):
f+++++++++ f6.WIP.c5-d60-pX-f6-20121101.mem079.20200831_153624.tar
120,233,226,240 100%   41.75MB/s    0:45:46 (xfr#1, to-chk=0/1)
Connection closed by 192.43.239.112
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
Connection closed by 192.43.239.112
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
receiving incremental file list
f+++++++++ f6.WIP.c5-d60-pX-f6-20121101.mem082.20200831_153624.tar
120,198,225,920 100%   30.29MB/s    1:03:04 (xfr#1, to-chk=0/1)
receiving incremental file list
After the fact we compare to source filelist and there are differences:
/datastore/d/dcfp/checks> find /datastore/d/dcfp/CAFE/forecasts/f6/*2012*.tar -type f > check_2012.txt
cut -c 37- check_2012.txt > cut_check_2012.txt
sed -i -e 's#^#./#' cut_check_2012.txt
diff cut_check_2012.txt ../NCI_file_lists/f6_2012_filelist.txt

51a52
> ./f6.WIP.c5-d60-pX-f6-20121101.mem052.20200831_153624.tar
61a63
> ./f6.WIP.c5-d60-pX-f6-20121101.mem063.20200831_153624.tar
82a85,86
> ./f6.WIP.c5-d60-pX-f6-20121101.mem085.20200831_153624.tar
> ./f6.WIP.c5-d60-pX-f6-20121101.mem086.20200831_153624.tar
We capture it in one of the many stout files:
/datastore/d/dcfp/logs> grep -rnw '/datastore/d/dcfp/logs/' -e 'error'
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20121101.mem085.20200831_153624.tar/stderr:3:rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20121101.mem086.20200831_153624.tar/stderr:3:rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/f6_2012_rsync.log.20200914160905:2:2020/09/14 16:07:51 [3334036] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/f6_2012_rsync.log.20200914160948:2:2020/09/14 16:07:51 [3339499] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
NB: note above that mem052 and mem063 don't appear in the grep -rnw '/datastore/d/dcfp/logs/' -e 'error' ???

ToDo: