cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

RFC Adding Job ID to Docker runs #426

Open tatarsky opened 8 years ago

tatarsky commented 8 years ago

Folks I've been finding LOTS of orphaned dockers out on the nodes lately. Killing them off is somewhat automated but due to my being a very careful person I still manually review the process before I kill anything.

The main method I use is try to trace it to a job in progress on the node and look at docker top. Most of the orphans BTW seem to be sitting there running bash. And tensorflow is a main image right now in this state.

I think one very helpful item to confirm my method would be if people would add to their docker execution script via the --name argument to docker run the contents of the qsub $PBS_JOBID variable.

Aka:

docker run --name $PBS_JOBID something something

Or perhaps more elaborate:

docker run --name "$USER-$PBS_JOBID"

Just something besides the random generated name to help me determine the state of the orphan.

It would be really helpful assuming the root cause of this cannot be fixed or its sporadic. I believe when we discussed this once in some cases the signal to end the docker run was not making it all the way to the docker image due to the way it was run. Can dig out the old Git discussion.

jchodera commented 8 years ago

docker run --name "$USER-$PBS_JOBID"

I love this idea!

I wonder if we can enforce it by having jobs that don't conform to this scheme automatically killed?

jchodera commented 8 years ago

By the way, it looks like Torque/Moab 9.0 has Docker integration baked-in: http://docs.adaptivecomputing.com/9-0-0/docker/Content/topics/docker/1-overview/integration.htm

tatarsky commented 8 years ago

I'm unlikely to do any autokilling due to the time I have left in a primary role. But I will work up the logic of a script I have to consider such non-complying items as killable.

polykrates commented 8 years ago

+1 from ComPath for enforcing the rule at submission time. Maybe someone from Juan's group can implement your design.