SIMEXP / psom

pipeline system for octave and matlab
http://psom.simexp-lab.org
Other
24 stars 13 forks source link

pipeline manager crash #36

Closed pbellec closed 9 years ago

pbellec commented 9 years ago

I get a weird crash on mammouth, where the manager does not even start. The eqsub file says: /var/spool/torque/mom_priv/jobs/218623.ms.m.SC: line 1: /home/bellecp1/tmp/test_psom/logs/PIPE_history.txt: No such file or directory touch: cannot touch `/home/bellecp1/tmp/test_psom/logs/PIPE.exit': No such file or directory

I suspect this has to do somehow with NFS. In any case, the fault tolerance features should solve the problem, but it would be sweet to ensure it does not happen in the first place.

pbellec commented 9 years ago

I am wondering if the crash is not related to the fact I erase the logs folder many times, and restart from the same logs folder. I would need to reproduce the problem and then simply try a different logs folder.

pbellec commented 9 years ago

Was able to reproduce. This time I have an explicit blame on NFS in my eqsub:

/var/spool/torque/mom_priv/jobs/226798.ms.m.SC: line 1: /home/bellecp1/tmp/test_psom/logs/PIPE_history.txt: Stale NFS file handle

pbellec commented 9 years ago

and now it seems like I cannot get the pipeline to run, and I have the original error message. Definitely something funny going on with NFS.

pbellec commented 9 years ago

now this is getting interesting. If I try to run a pipeline in a different folder, the problem does not appear. But if I keep deleting the logs folder, and restart the pipeline that crashed at the same location, the same error happens. The job gets in R status a long time before I get the error. BUT if I remove only the PIPE.lock file, as well as the workers folder (to kill all workers), then the pipeline runs well. Looks like there is something to do with removing / creating a folder under the same name quickly. This may not propagate fast on the NFS. Not sure how this can be worked around, but the problem seems quite easy to reproduce.

pbellec commented 9 years ago

At this stage it seems easy enough to simply remove what is inside the logs folder, rather than the logs folder itself, for a quick fresh restart. I'll check if that addresses the problem, in which case I will not try to further correct the problem.

pbellec commented 9 years ago

Seems to work if the logs folder itself is not removed. Won't fix.