cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

hal: no space left on device #417

Closed corcra closed 8 years ago

corcra commented 8 years ago

Experiencing this on hal right now:

> cd ~
> touch testfile
> touch: cannot touch "testfile": No space left on device

This is preventing me from doing anything useful (mostly writing results files and using git). I tried deleting a bunch of files but it's still happening, what's going on?

jchodera commented 8 years ago

I can confirm the same behavior, despite df showing 91T free:

[chodera@mskcc-ln1 ~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              41G   27G   12G  70% /
/dev/sda5             834G  144G  648G  19% /state/partition1
/dev/sda2              39G   25G   12G  69% /var
tmpfs                 253G  370M  252G   1% /tmp
/dev/gpfsdev          1.6P  1.5P   91T  95% /cbio
/dev/snapdev          546T  448T   98T  83% /snapshot
...

In case this is really an "out of space" issue, I've finally been able to release some lab Folding@home directories to free up 26T of space. Had meant to do this last weekend.

I can now create files again:

[chodera@mskcc-ln1 ~]$ touch testfile
[chodera@mskcc-ln1 ~]$

I don't know why this occurred, but if it was indeed the worst-case scenario where we suddenly ran out of quota or space, all long-running jobs in progress may now be corrupted and should be checked. On our end, I think this primarily impacts @MehtapIsik, where any running job is almost certainly garbage now and should be terminated and examined in case it needs to be wiped and restarted from the beginning.

corcra commented 8 years ago

I could briefly create files a minute ago, but it seems to be out of space again.

jchodera commented 8 years ago

I can confirm this behavior:

[chodera@mskcc-ln1 ~]$ touch x
touch: cannot touch `x': No space left on device
corcra commented 8 years ago

My long-running jobs appear to still be writing to disk though, and the results don't look corrupted...

jchodera commented 8 years ago

Phew! That's the best news I've heard all day!

tatarsky commented 8 years ago

Someone managed to overnight consume all the inodes. Investigating. Will affect new files until I expand count.

tatarsky commented 8 years ago

Count expanded. Now finding who the heavy consumer of the inodes was.

tatarsky commented 8 years ago

I have located the user. I may need to kill their jobs if I cannot contact them. Very large number of inodes added. Alerts did fire in the wee hours of the morning but I am not monitoring 7x24.

tatarsky commented 8 years ago

User has had an inode quota applied and will be contacted.

tatarsky commented 8 years ago

Group has had inode quota applied per @juanperin. Closing this. Further discussions will be in the hpc-request ticket.

jchodera commented 8 years ago

Thanks once again for the quick detective work, @tatarsky!