Open Thomas-Moore-Creative opened 4 years ago
Suggestions of things to try (noting that if you solve the network issue then /datastore
disk quota limit -> tape migration will be a huge bottleneck):
$ time bbcp -z -Z 10050:10100 -P 4 -r -V -w 1024m -s 20 -S 'ssh -x -a -oFallBackToRsh=no %I -l %U %H /apps/bbcp/15.02.03.01.1/bin/bbcp' ${NCI_USER}@${RAIJIN}:/path/to/remote/data/ /path/to/local/data/
- bbcp is available as a module on Gadi but looks to be a slightly different version)globus-url-copy -tcp-bs 16M -bs 16M -p 4 -vb -r sshftp://${PAWSEY_USER}@${PAWSEY_DATA}/path/to/data/ file:///path/to/local/destination/
)Also I'd suggest putting a dmput
somewhere in your transfer script to try to get it off disk and onto tape ASAP, and be nice to other tape store users, probably?
Also I'd suggest putting a
dmput
somewhere in your transfer script to try to get it off disk and onto tape ASAP, and be nice to other tape store users, probably?
@hot007 : I assume above you are talking about running a bbcp
from the NCI-side? If so how have you triggered a dmput
over on Ruby
on the CSIRO-side once a specific copy is finished?
well, you have to trigger it from CSIRO side as you have to pull to CSIRO not push from NCI. So I suppose you could break your copies up and ; dmput *
at the end of each. Alternatively I suppose a cron job to check for new data and dmput it that you then disable once the transfers are done (so you can in due course pull the data back to disk again!).
well, you have to trigger it from CSIRO side as you have to pull to CSIRO not push from NCI. So I suppose you could break your copies up and
; dmput *
at the end of each. Alternatively I suppose a cron job to check for new data and dmput it that you then disable once the transfers are done (so you can in due course pull the data back to disk again!).
OK! I misread your bbcp
code above - you are running this command from CSIRO-side?
I assume this also means one can't use the power of #PBS -q copyq
(outlined here: https://opus.nci.org.au/display/Help/bbcp) to move data from Gadi
to CSIRO machines
? I don't get why the #PBS -q copyq
option would be shown in the documents as bbcp -z -P 2 -s 16 -w 4m -S "bbcp" -T "ssh -x -a -oFallBackToRsh=no %I -l %U %H /some/other/place/bin/bbcp" somefiles remoteuser@remotehost.edu:someplace/
if you couldn't "push from NCI"?
But maybe I just need moar coffeeee?
Well, copyq
just gives you a 10hr job limit on your access to gadi-dm ;-)
But yes, that code was run FROM Pearcey to (then) raijin-dm. In general you can push from NCI, but you can't push to CSIRO - our machines aren't visible from NCI so like rsync
this has to be originated on our side. So in our case we have to specify the path to NCI's bbcp
but not ours.
Initial use case
Archive Decadal Climate Forecasting Project (DCFP) NCI Australasian Leadership Computing Grants (ALCG) raw netdcf to tape at CSIRO
The CSIRO DCFP is currently working under a recent NCI ALCG merit allocation - https://research.csiro.au/dfp/dcfp-awarded-key-computation-by-the-nci/ . Current storage limitations means that as the effort proceeds large collections of files will need to be archived constantly back to tape at CSIRO. If data-transfer is not fast enough work will stop at NCI due to lack of storage resources.
The current task is to move 8 x 11TB collections of tarfiles where each 11TB collection is a directory with:
112GB is close to the recommended 100GB files size for the CSIRO tape system and also the size of each model "member" keeping the essential structure of each model run.
Example
rsync
command used:ruby:/datastore/d/dcfp/CAFE/forecasts/f6> rsync -avPSW <user>@gadi-dm.nci.org.au:/scratch/v14/<user>/tar_tmp/f6.WIP.c5-d60-pX-f6-20181101.20200820_174610 /datastore/d/dcfp/CAFE/forecasts/f6/.
NB: more recent advice from NCI recommends:
rsync -avPS -e "ssh -T -c arcfour -o Compression=no -x" <username>@gadi-dm.nci.org.au:</path/to/source> <dest>
Issues:
rsync
transfer speeds, ordinarily under 20MB/s, mean that 90TB of data would take over 50 days to transfer in serial.rsync
is possible but the "front-end" spinning disk on CSIRO/datastore
is only 15TBscreen
to runrsync
commands invariably results in regularbroken pipe
disconnects and a potential problem arrises when thersync
is restarted and some of the previously transfered files are already moved to tape.rsync
may start the transfer over again for these files!!!rsync
is DONE SUCCESSFULLY and has beendmput
to tape on/datastore
Solutions:
Ond has already suggested something like the following as an example:
ls *.tar.xz | parallel --lb -j10 "until rsync -ailP --log-file=rsync.log /scratch1/temp/{} pearcey:/scratch1/ ; do echo rsync failed - resyncing {}; sleep 1; done"
but I still need to get my head around how this would work.