Unscheduled process running on GPU-2-14

cBio / cbio-cluster

MSKCC cBio cluster documentation

12 stars 2 forks source link

Unscheduled process running on GPU-2-14 #409

Closed MehtapIsik closed 8 years ago

MehtapIsik commented 8 years ago

I don't have any GPU jobs running in the Hal cluster queue right now but somehow I have these processes left running on GPU-2-14. These must be left-over from earlier jobs run on batch queue. How can I kill these processes?

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21234 misik 20 0 285g 406m 90m R 100.5 0.2 5871:23 yank
21233 misik 20 0 285g 419m 103m R 100.2 0.2 5894:06 yank
21236 misik 20 0 285g 419m 103m R 100.2 0.2 5853:01 yank
21235 misik 20 0 285g 406m 90m R 99.8 0.2 5884:33 yank

tatarsky commented 8 years ago

Checking first please to see if I can understand why. But basically the answer will be you ssh to the system and kill the PID.

tatarsky commented 8 years ago

I don't show those jobs on gpu-2-14. So I can only assume you killed them.

jchodera commented 8 years ago

How recently did your GPU jobs terminate in the queue, @MehtapIsik? Was this recent, or are these processes that were running for days after they were supposedly terminated?

Just wondering if we need to be on the lookout for this thing as we scale up our yank runs.

MehtapIsik commented 8 years ago

These processes were running days after the batch GPU jobs were terminated.

Thanks, now they are killed.

tatarsky commented 8 years ago

I assume you are stating you killed them. As I did not.

jchodera commented 8 years ago

These are all MPI processes that appear to fail to be cleaned up properly by Torque. This thread is old, but suggests that the problem may be similar---Torque may know about the processes being launched via mpirun.

I'll look into

whether we should be using a different argument to the hydra mpirun distributed with conda
if there's a way to incorporate some sort of detection into our code (which uses mpi4py) that will detect when the parent process is killed and terminate immediately

tatarsky commented 8 years ago

I am going to request tracking this via #415 please.