SIMEXP / psom

pipeline system for octave and matlab
http://psom.simexp-lab.org
Other
24 stars 13 forks source link

jobs get stuck #71

Closed pbellec closed 9 years ago

pbellec commented 9 years ago

I had a problem yesterday with a couple of jobs running on guillimin. One example of job was named 'psom3' and it ran on sw-2r09-n66. The job started at 4:13 am, ran smoothly for about two hours, then got stuck at around 5:52 in the middle of an operation that usually runs smoothly.The job still responded to "kill -0" with 0. But nothing happened with these jobs. The job ended up being killed around 7:12 am when reaching the wall time of 3 hours.

pbellec commented 9 years ago

I was not able to investigate further. Looks like a problem with the nodes on guillimin. Would be good to eventually have a mechanism to detect/kill idle jobs, but that may be too complicated of a feature for psom. At this stage I won't fix.