cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Assistance determining if this is a stray docker #428

Closed tatarsky closed 8 years ago

tatarsky commented 8 years ago

I believe a docker on gpu-1-16 is improperly exited and may be causing errors as a result.

But before I kill it I'd like to determine that 100%

gpu-1-16: 5e23c810a988        corcra/tf-hal       "/bin/bash"         5 days ago          Up 5 days                               sharp_goldstine     

Can the owner please check it if they happen to monitor Git?

I show only this job on the node:

7566200.hal-sched1.loc  (somebody else)       batch    pj_2920ee78-479d  25401     1      1   16gb  96:00:00 R  00:01:28   gpu-1-16/0

And that job I do not show the same PID as the item in docker:

nvidia-smi
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      9598    C   /usr/bin/python                               5890MiB |
+-----------------------------------------------------------------------------+

docker top 5e23c810a988
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                3670                4512                0                   Jul05               pts/1               00:00:00            /bin/bash
root                9598                3670                1                   Jul05               pts/1               02:19:24            /usr/bin/python /usr/local/bin/ipython
tatarsky commented 8 years ago

User with jobs on the machine confirmed it was NOT theirs. So I'm killing it as a stray.