Thomas-Moore-Creative / CSIRO-NCI-data-best-practice

open repo to help us formulate best practice techniques and codes for data management between CSIRO and NCI systems
GNU General Public License v3.0
2 stars 0 forks source link

Archive DCFP's NCI ALCG data to tape at CSIRO #1

Open Thomas-Moore-Creative opened 4 years ago

Thomas-Moore-Creative commented 4 years ago

Initial use case

Archive Decadal Climate Forecasting Project (DCFP) NCI Australasian Leadership Computing Grants (ALCG) raw netdcf to tape at CSIRO

The CSIRO DCFP is currently working under a recent NCI ALCG merit allocation - https://research.csiro.au/dfp/dcfp-awarded-key-computation-by-the-nci/ . Current storage limitations means that as the effort proceeds large collections of files will need to be archived constantly back to tape at CSIRO. If data-transfer is not fast enough work will stop at NCI due to lack of storage resources.

The current task is to move 8 x 11TB collections of tarfiles where each 11TB collection is a directory with:

112GB is close to the recommended 100GB files size for the CSIRO tape system and also the size of each model "member" keeping the essential structure of each model run.

Example rsync command used:

ruby:/datastore/d/dcfp/CAFE/forecasts/f6> rsync -avPSW <user>@gadi-dm.nci.org.au:/scratch/v14/<user>/tar_tmp/f6.WIP.c5-d60-pX-f6-20181101.20200820_174610 /datastore/d/dcfp/CAFE/forecasts/f6/.

NB: more recent advice from NCI recommends:

rsync -avPS -e "ssh -T -c arcfour -o Compression=no -x" <username>@gadi-dm.nci.org.au:</path/to/source> <dest>

Issues:

Solutions:

Ond has already suggested something like the following as an example:

ls *.tar.xz | parallel --lb -j10 "until rsync -ailP --log-file=rsync.log /scratch1/temp/{} pearcey:/scratch1/ ; do echo rsync failed - resyncing {}; sleep 1; done"

but I still need to get my head around how this would work.

hot007 commented 4 years ago

Suggestions of things to try (noting that if you solve the network issue then /datastore disk quota limit -> tape migration will be a huge bottleneck):

hot007 commented 4 years ago

Also I'd suggest putting a dmput somewhere in your transfer script to try to get it off disk and onto tape ASAP, and be nice to other tape store users, probably?

Thomas-Moore-Creative commented 4 years ago

Also I'd suggest putting a dmput somewhere in your transfer script to try to get it off disk and onto tape ASAP, and be nice to other tape store users, probably?

@hot007 : I assume above you are talking about running a bbcp from the NCI-side? If so how have you triggered a dmput over on Ruby on the CSIRO-side once a specific copy is finished?

hot007 commented 4 years ago

well, you have to trigger it from CSIRO side as you have to pull to CSIRO not push from NCI. So I suppose you could break your copies up and ; dmput * at the end of each. Alternatively I suppose a cron job to check for new data and dmput it that you then disable once the transfers are done (so you can in due course pull the data back to disk again!).

Thomas-Moore-Creative commented 4 years ago

well, you have to trigger it from CSIRO side as you have to pull to CSIRO not push from NCI. So I suppose you could break your copies up and ; dmput * at the end of each. Alternatively I suppose a cron job to check for new data and dmput it that you then disable once the transfers are done (so you can in due course pull the data back to disk again!).

OK! I misread your bbcp code above - you are running this command from CSIRO-side?

I assume this also means one can't use the power of #PBS -q copyq (outlined here: https://opus.nci.org.au/display/Help/bbcp) to move data from Gadi to CSIRO machines? I don't get why the #PBS -q copyq option would be shown in the documents as bbcp -z -P 2 -s 16 -w 4m -S "bbcp" -T "ssh -x -a -oFallBackToRsh=no %I -l %U %H /some/other/place/bin/bbcp" somefiles remoteuser@remotehost.edu:someplace/ if you couldn't "push from NCI"?

But maybe I just need moar coffeeee?

hot007 commented 4 years ago

Well, copyq just gives you a 10hr job limit on your access to gadi-dm ;-) But yes, that code was run FROM Pearcey to (then) raijin-dm. In general you can push from NCI, but you can't push to CSIRO - our machines aren't visible from NCI so like rsync this has to be originated on our side. So in our case we have to specify the path to NCI's bbcp but not ours.

Thomas-Moore-Creative commented 4 years ago

See the developing solution here https://github.com/Thomas-Moore-Creative/CSIRO-NCI-data-best-practice/blob/master/Solution_Archive_DCFP_NCI_ALCG_data_to_CSIRO_tape.md