cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

recover prematurely terminated active queue job #395

Closed lzamparo closed 8 years ago

lzamparo commented 8 years ago

Hi,

Say I have a job enqueued in the active queue that's waiting to start:

[zamparol@mskcc-ln1 ~]$ showq -w class=active | grep zamparol
7055514            zamparol       Idle     1     3:00:00  Mon Mar 28 11:44:49

However, since it had been waiting for multiple hours without actually starting, my terminal session timed out and was prematurely ended:

[zamparol@mskcc-ln1 submit_scripts]$ gpuactive
qsub: waiting for job 7055514.hal-sched1.local to start

Write failed: Broken pipe
mski1743:$ 

If I ssh back into hal, is there any way to salvage this actively enqueued job? I'm not so familiar with the torque tools, but surely there is a way?

Thanks,

akahles commented 8 years ago

I would recommend to use a screen or tmux session for this purpose. Even when your ssh times out the interactive session will stay active within screen or tmux. I don't think there is a way to get back to your interactive job otherwise.

tatarsky commented 8 years ago

Yes to the comment above and what is timing out your SSH? We don't do that.

Good question on re-attaching. I believe you've lost the tty that would make that possible. One moment while I look.

lzamparo commented 8 years ago

@akahles ok thanks, I'll delete and resubmit (sigh).

@tatarsky I don't know what is timing out my SSH session, could it be some default of my client on OSX? My ~/.ssh/config has nothing set in this regard. Should I explicitly set something like ServerAliveInterval? Also, any idea if qrerun would recover an active queue session that had not yet launched?

akahles commented 8 years ago

Why would you want to recover an active queue session that had not yet launched? I believe that the benefit in priority through Q-time is minimal.

tatarsky commented 8 years ago

I tend to use ServerAliveInterval 60 when some middle box is doing idle packet based termination.

I don't know what qrerun does with an interactive qsub. The tty I believe needs to be still attached to the shell.

lzamparo commented 8 years ago

Ok, I'll kill the job and resubmit. Thx.

tatarsky commented 8 years ago

@akahles queuetime is not a factor in queue priority in the current config IIRC.

akahles commented 8 years ago

@tatarsky thanks for clarifying. Then I see no disadvantage in just resubmitting the job, as it had not started yet.

lzamparo commented 8 years ago

Yeah, just killed + resubmitted and am learning tmux to avoid this in the future