dib-lab / farm-notes

notes on the farm cluster
16 stars 9 forks source link

Managing scratch data [Discussion] #30

Open mr-eyes opened 2 years ago

mr-eyes commented 2 years ago

I submitted more than 50 simultaneous jobs that went processed on multiple nodes. Unfortunately, the jobs were writing to /scratch/$user and failed without removing the scratch data.

This is a simple command to delete all data from /scratch/$user

scontrol show node | grep "NodeName" | cut -d'=' -f2 | cut -d' ' -f1 |  xargs -I {} ssh $USER@{} rm -rf /scratch/$USER/
ctb commented 2 years ago

What if we made a /group/ctbrowngrp/scratch/tmp directory that explicitly could be cleaned out at any time?

mr-eyes commented 2 years ago

And each user would have a scratch directory at /group/ctbrowngrp/scratch/tmp/${USER} ?

ctb commented 2 years ago

sure, something like that? I don't think we need to enforce it, can just suggest it. it's easy enough to figure out who owns files.

my logic behind this suggestion is

mr-eyes commented 2 years ago

Great idea! Would it be convenient to symlink this directory to the nodes /scratch/$USER?

ctb commented 2 years ago

if we provide scripts/code for people to use, we could put that in there for sure! but it's not really something we can enforce, just suggest/facilitate.

SichongP commented 2 years ago

I just want to point out that the main reason to use /scratch on compute nodes is those directories are local to compute nodes so that IO heavy jobs won't burden main storage disks.

In that sense using /group/ctbrowngrp2 sort of defeats the purpose of local scratch directories. I think we just need to make it very clear when to use /scratch and when to use /group/ctbrowngrp2/scratch, which can be confusing :)

mr-eyes commented 2 years ago

totally missed the real point of using local nodes scratch! So there is no real solution to expand local nodes scratch space for the intensive I/O jobs?

ctb commented 2 years ago

I thought of this, but in conversations with HPC folk, two things came up that made me decide this route -

so I would suggest making this point as a kind of second tier thing, like "if you have something that is going to be ridiculously IO intensive on small amounts of data (under a few hundred GB), please consider using local scratch" - but, realistically, there are relatively few such use cases in genomics.

thoughts?

SichongP commented 2 years ago

"if you have something that is going to be ridiculously IO intensive on small amounts of data (under a few hundred GB), please consider using local scratch" - but, realistically, there are relatively few such use cases in genomics.

I agree! I read through the thread again and realized that this new /tmp directory should actually serve completely different purpose than /scratch. I've mainly use /scratch when I need to zip/unzip 100s of small files simultaneously, or running sorting algorithms that read/write many small chunk files. I've never needed more than ~100gb space for this and the need for it is indeed relatively low.

I would suggest not calling this new directory "scratch" though, perhaps simply "temp" or "draft"? I'm training some undergrads in our lab and I can see how having two "scratch" directories can be very confusing for someone just getting started with HPC :)

ctb commented 2 years ago

you can use /group/ctbrowngrp2 directly if you wish.

BUT - thinking about new students - I think telling them that there is temp or draft along with scratch sounds confusing, too! You can maybe just ignore /scratch altogether and tell them to use /group/ctbrowngrp2 or /group/ctbrowngrp/scratch directly?

ctb commented 2 years ago

(my watchword in teaching is that complete correctness is overrated; we want to maximize ability to Get Things Done while minimizing impact of bad behavior on others. Speed and optimization is strictly tertiary, and can be more easily addressed once the sweet sweet dopamine of success has been triggered.)

SichongP commented 2 years ago

That makes sense. I agree :)