Open tatarsky opened 8 years ago
docker run --name "$USER-$PBS_JOBID"
I love this idea!
I wonder if we can enforce it by having jobs that don't conform to this scheme automatically killed?
By the way, it looks like Torque/Moab 9.0 has Docker integration baked-in: http://docs.adaptivecomputing.com/9-0-0/docker/Content/topics/docker/1-overview/integration.htm
I'm unlikely to do any autokilling due to the time I have left in a primary role. But I will work up the logic of a script I have to consider such non-complying items as killable.
+1 from ComPath for enforcing the rule at submission time. Maybe someone from Juan's group can implement your design.
Folks I've been finding LOTS of orphaned dockers out on the nodes lately. Killing them off is somewhat automated but due to my being a very careful person I still manually review the process before I kill anything.
The main method I use is try to trace it to a job in progress on the node and look at docker top. Most of the orphans BTW seem to be sitting there running
bash
. And tensorflow is a main image right now in this state.I think one very helpful item to confirm my method would be if people would add to their docker execution script via the
--name
argument to docker run the contents of the qsub $PBS_JOBID variable.Aka:
docker run --name $PBS_JOBID something something
Or perhaps more elaborate:
docker run --name "$USER-$PBS_JOBID"
Just something besides the random generated name to help me determine the state of the orphan.
It would be really helpful assuming the root cause of this cannot be fixed or its sporadic. I believe when we discussed this once in some cases the signal to end the docker run was not making it all the way to the docker image due to the way it was run. Can dig out the old Git discussion.