Managing scratch data [Discussion]

mr-eyes commented 2 years ago

I submitted more than 50 simultaneous jobs that went processed on multiple nodes. Unfortunately, the jobs were writing to /scratch/$user and failed without removing the scratch data.

This is a simple command to delete all data from /scratch/$user

scontrol show node | grep "NodeName" | cut -d'=' -f2 | cut -d' ' -f1 |  xargs -I {} ssh $USER@{} rm -rf /scratch/$USER/

ctb commented 2 years ago

What if we made a /group/ctbrowngrp/scratch/tmp directory that explicitly could be cleaned out at any time?

mr-eyes commented 2 years ago

And each user would have a scratch directory at /group/ctbrowngrp/scratch/tmp/${USER} ?

ctb commented 2 years ago

sure, something like that? I don't think we need to enforce it, can just suggest it. it's easy enough to figure out who owns files.

my logic behind this suggestion is

we clearly need big scratch space (motivation for ctbrowngrp2!)
sometimes that scratch space is explicitly temporary, as in, while jobs are running - that would be the suggestion for a scratch/tmp directory. This space would be something you can trash after every set of big runs.
other times you want a space for running potentially very big jobs, and you don't want those jobs to run out of space. And that stuff isn't as "temporary". I envision trying to keep ctbrowngrp2 mostly free (like, 50% or below) for this purpose - 50 TB should be enough even for @SichongP :).

mr-eyes commented 2 years ago

Great idea! Would it be convenient to symlink this directory to the nodes /scratch/$USER?

ctb commented 2 years ago

if we provide scripts/code for people to use, we could put that in there for sure! but it's not really something we can enforce, just suggest/facilitate.

SichongP commented 2 years ago

I just want to point out that the main reason to use /scratch on compute nodes is those directories are local to compute nodes so that IO heavy jobs won't burden main storage disks.

In that sense using /group/ctbrowngrp2 sort of defeats the purpose of local scratch directories. I think we just need to make it very clear when to use /scratch and when to use /group/ctbrowngrp2/scratch, which can be confusing :)

mr-eyes commented 2 years ago

totally missed the real point of using local nodes scratch! So there is no real solution to expand local nodes scratch space for the intensive I/O jobs?

ctb commented 2 years ago

I thought of this, but in conversations with HPC folk, two things came up that made me decide this route -

the scratch directories on the nodes are not going to be increased in size, period. There's simply no more disk.
the network to the file server is ridiculously fast, and I think the main reason now to avoid using the home directory disk is in case of many reads/writes, which will be slightly slower over the network and ALSO will bog down shared disk. hence the goal of putting the new big ctbrowngrp2 scratch space on a different disk, too.

so I would suggest making this point as a kind of second tier thing, like "if you have something that is going to be ridiculously IO intensive on small amounts of data (under a few hundred GB), please consider using local scratch" - but, realistically, there are relatively few such use cases in genomics.

thoughts?

SichongP commented 2 years ago

"if you have something that is going to be ridiculously IO intensive on small amounts of data (under a few hundred GB), please consider using local scratch" - but, realistically, there are relatively few such use cases in genomics.

I agree! I read through the thread again and realized that this new /tmp directory should actually serve completely different purpose than /scratch. I've mainly use /scratch when I need to zip/unzip 100s of small files simultaneously, or running sorting algorithms that read/write many small chunk files. I've never needed more than ~100gb space for this and the need for it is indeed relatively low.

I would suggest not calling this new directory "scratch" though, perhaps simply "temp" or "draft"? I'm training some undergrads in our lab and I can see how having two "scratch" directories can be very confusing for someone just getting started with HPC :)

ctb commented 2 years ago

you can use /group/ctbrowngrp2 directly if you wish.

BUT - thinking about new students - I think telling them that there is temp or draft along with scratch sounds confusing, too! You can maybe just ignore /scratch altogether and tell them to use /group/ctbrowngrp2 or /group/ctbrowngrp/scratch directly?

ctb commented 2 years ago

(my watchword in teaching is that complete correctness is overrated; we want to maximize ability to Get Things Done while minimizing impact of bad behavior on others. Speed and optimization is strictly tertiary, and can be more easily addressed once the sweet sweet dopamine of success has been triggered.)

SichongP commented 2 years ago

That makes sense. I agree :)

dib-lab / farm-notes

Managing scratch data [Discussion] #30