Closed corcra closed 8 years ago
I can confirm the same behavior, despite df
showing 91T free:
[chodera@mskcc-ln1 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 41G 27G 12G 70% /
/dev/sda5 834G 144G 648G 19% /state/partition1
/dev/sda2 39G 25G 12G 69% /var
tmpfs 253G 370M 252G 1% /tmp
/dev/gpfsdev 1.6P 1.5P 91T 95% /cbio
/dev/snapdev 546T 448T 98T 83% /snapshot
...
In case this is really an "out of space" issue, I've finally been able to release some lab Folding@home directories to free up 26T of space. Had meant to do this last weekend.
I can now create files again:
[chodera@mskcc-ln1 ~]$ touch testfile
[chodera@mskcc-ln1 ~]$
I don't know why this occurred, but if it was indeed the worst-case scenario where we suddenly ran out of quota or space, all long-running jobs in progress may now be corrupted and should be checked. On our end, I think this primarily impacts @MehtapIsik, where any running job is almost certainly garbage now and should be terminated and examined in case it needs to be wiped and restarted from the beginning.
I could briefly create files a minute ago, but it seems to be out of space again.
I can confirm this behavior:
[chodera@mskcc-ln1 ~]$ touch x
touch: cannot touch `x': No space left on device
My long-running jobs appear to still be writing to disk though, and the results don't look corrupted...
Phew! That's the best news I've heard all day!
Someone managed to overnight consume all the inodes. Investigating. Will affect new files until I expand count.
Count expanded. Now finding who the heavy consumer of the inodes was.
I have located the user. I may need to kill their jobs if I cannot contact them. Very large number of inodes added. Alerts did fire in the wee hours of the morning but I am not monitoring 7x24.
User has had an inode quota applied and will be contacted.
Group has had inode quota applied per @juanperin. Closing this. Further discussions will be in the hpc-request ticket.
Thanks once again for the quick detective work, @tatarsky!
Experiencing this on hal right now:
This is preventing me from doing anything useful (mostly writing results files and using git). I tried deleting a bunch of files but it's still happening, what's going on?