IMPORTANT User contact attempt due to issue with spool file size

cBio / cbio-cluster

MSKCC cBio cluster documentation

12 stars 2 forks source link

IMPORTANT User contact attempt due to issue with spool file size #314

Closed tatarsky closed 9 years ago

tatarsky commented 9 years ago

I am attempting to contact @angelamyu

If you know this user please try to assist.

I have tried email and am going to call shortly. Several jobs by this user have extremely large stdout files in the Torque spool. We almost filled the torque spool disk a moment ago.

I am unclear why these jobs have such massive (150MB and above) stdout files in the spool but I need to examine the jobs with this person. Or kill those jobs but I am trying to not do so.

I have managed to clear some space but the spool area is not large and has never had this happen before.

tatarsky commented 9 years ago

These are jobs that are not even running yet so I am going to remove them from the queue and discuss it with the user in the morning. I feel its more important to protect the state of running jobs. If you disagree, comment below.

tatarsky commented 9 years ago

I have figured it out. This is a result of these jobs in the lowpriority queue. They have been pre-empted multiple times and each time the output file is appended to. The result is a massive Torque spool file.

I am emailing the user suggesting these be re-queued in regular batch queues or reduce the stdout....

juanperin commented 9 years ago

Thanks Paul, I’m sorry i missed this earlier. Actually, i know this user well because she interned with me at CHOP a while ago. Sounds like its all sorted out, and given my experience with her in the past, she’s a very good user and learns quickly. I believe i can get a hold of her if you don’t hear back.

Juan

On Sep 10, 2015, at 8:44 PM, tatarsky notifications@github.com wrote:

I have figured it out. This is a result of these jobs in the lowpriority queue. They have been pre-empted multiple times and each time the output file is appended to. The result is a massive Torque spool file.

I am emailing the user suggesting these be re-queued in regular batch queues or reduce the stdout....

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/314#issuecomment-139419705.

tatarsky commented 9 years ago

Yes, email contact made and we're looking at the issue. Jobs with the large spool files have been killed to be re-run and disk usage now normal. Immediate crisis over and will work with user to determine best way forward!

akahles commented 9 years ago

Just a hunch. Would the qsub option -k help in this case? The huge output files would be written to the user's home and should not be held in the spool.

tatarsky commented 9 years ago

Its part of the items I am mentioning ;)

akahles commented 9 years ago

Sorry, didn't see that.

tatarsky commented 9 years ago

You could not have seen it as it will be in an email ;)

jchodera commented 9 years ago

Thanks for the quick action!

Three thoughts:

Is there an option to erase the torque spool for the job when the job is re-run?
Is there an option to terminate a job when the torque spool reaches a preset size?
Is it possible to have the torque spool written to a large disk in designing the new Torque master node?

tatarsky commented 9 years ago

For the first two I will look but I did not see one.

For the later, already in my notes for said new master node spec! (And was already something I considered undersized but had never seen this case of filling it before)