Closed MehtapIsik closed 8 years ago
Checking first please to see if I can understand why. But basically the answer will be you ssh to the system and kill the PID.
I don't show those jobs on gpu-2-14. So I can only assume you killed them.
How recently did your GPU jobs terminate in the queue, @MehtapIsik? Was this recent, or are these processes that were running for days after they were supposedly terminated?
Just wondering if we need to be on the lookout for this thing as we scale up our yank
runs.
These processes were running days after the batch GPU jobs were terminated.
Thanks, now they are killed.
I assume you are stating you killed them. As I did not.
These are all MPI processes that appear to fail to be cleaned up properly by Torque. This thread is old, but suggests that the problem may be similar---Torque may know about the processes being launched via mpirun
.
I'll look into
mpirun
distributed with conda
mpi4py
) that will detect when the parent process is killed and terminate immediatelyI am going to request tracking this via #415 please.
I don't have any GPU jobs running in the Hal cluster queue right now but somehow I have these processes left running on GPU-2-14. These must be left-over from earlier jobs run on batch queue. How can I kill these processes?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21234 misik 20 0 285g 406m 90m R 100.5 0.2 5871:23 yank
21233 misik 20 0 285g 419m 103m R 100.2 0.2 5894:06 yank
21236 misik 20 0 285g 419m 103m R 100.2 0.2 5853:01 yank
21235 misik 20 0 285g 406m 90m R 99.8 0.2 5884:33 yank