Thomas-Moore-Creative / CSIRO-NCI-data-best-practice

open repo to help us formulate best practice techniques and codes for data management between CSIRO and NCI systems
GNU General Public License v3.0
2 stars 0 forks source link

GNU parallel vs xargs #6

Open hot007 opened 1 year ago

hot007 commented 1 year ago

I've never been an xargs user, it confuses me, but here's an example of doing an rsync with xargs insead of parallel, just documenting this here for reference (h/t @dsroberts). The following copies the contents of the current directory in 8 parallel streams, using xargs as a sort of metascheduler.

printf '%s\n' * | xargs -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --specials --partial --progress --one-file-system --hard-links {} /path/to/destination/
hot007 commented 1 year ago

He observed a copy rate of about 1.2GBps within NCI which is about what we'd expect from our parallel tests - appears to be CPU limited.

dsroberts commented 1 year ago

Hi all. I had a bit more of a think about this, and I came up with the following:

xargs -a <( find ! -type d ) -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --specials --partial --progress --one-file-system --hard-links --relative {} /path/to/destination

This launches a different rsync process for every file. So probably too much overhead when transferring small files or lots of symlinks etc. However, if you're transferring lots of large files (My test is 5.6TB across 273 files), this gets around login node CPU time limits, as none of the individual rsyncs hit the CPU limit. It also has the benefit of neatly balancing transfers with top-level directories of varying sizes.

The find command probably needs refinement, I'm only transferring files, so didn't need to think too hard about it.

Thomas-Moore-Creative commented 1 year ago

Thanks @hot007 & @dsroberts for documenting this here. It's been a little while since I really tested the parallel approach but hopeful it's still useful as a template that can offer some chunkier performance.

dsroberts commented 7 months ago

Just resurrecting this post, I've been using this to move lots of data around and I've found that in the case of files of varying size, you can wind up with a 'long tail' problem whereby a large file ends up towards the end of the file list which means the whole command takes much longer to run. I propose the following:

xargs -a <( find ! -type d -ls | sort -h -k7 -r | awk '{print $11}' ) -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --spe
cials --partial --progress --one-file-system --hard-links --relative {}

Which sorts the output of find by file size, meaning the largest files are always transferred first. As above, the find needs refinement as this will fall over for filenames with spaces.

Thomas-Moore-Creative commented 7 months ago

Just resurrecting this post, I've been using this to move lots of data around and I've found ...

Thanks for that advice / experience @dsroberts. I have not needed to move lots of data around recently but it's great that you are using and now "tuning" this.

Q: In your opinion does this still beat the "new" offerings via Globus?

dsroberts commented 7 months ago

I'm moving data between file systems on Gadi, so not really in a place where I can compare it with Globus.

hot007 commented 7 months ago

That is some utterly arcane bash!! That said, that's a good idea, thank you. I haven't used Globus enough recently to make a meaningful comment, but my observation of both Globus and rclone (which is also great esp for cloud transfers), is that I'm not aware of either of them doing anything smart about file size in ordering transfers. I've got a feeling that Globus might have the capability to split files which would help, but my guess is that the above is a good solution for command-line data transfers. The downside with Globus is that it tends to take a few days to get storage locations added to endpoints in my experience so if your storage issue is urgent you may need to initiate a transfer like this to chug away while setting up something more professional for future use...

dsroberts commented 7 months ago

Splitting the files and transferring the chunks in parallel would negate the need to sort by file size. But well beyond anything that can be done sensibly in bash.