Closed pbellec closed 9 years ago
I am wondering if the crash is not related to the fact I erase the logs folder many times, and restart from the same logs folder. I would need to reproduce the problem and then simply try a different logs folder.
Was able to reproduce. This time I have an explicit blame on NFS in my eqsub:
/var/spool/torque/mom_priv/jobs/226798.ms.m.SC: line 1: /home/bellecp1/tmp/test_psom/logs/PIPE_history.txt: Stale NFS file handle
and now it seems like I cannot get the pipeline to run, and I have the original error message. Definitely something funny going on with NFS.
now this is getting interesting. If I try to run a pipeline in a different folder, the problem does not appear. But if I keep deleting the logs folder, and restart the pipeline that crashed at the same location, the same error happens. The job gets in R status a long time before I get the error. BUT if I remove only the PIPE.lock file, as well as the workers folder (to kill all workers), then the pipeline runs well. Looks like there is something to do with removing / creating a folder under the same name quickly. This may not propagate fast on the NFS. Not sure how this can be worked around, but the problem seems quite easy to reproduce.
At this stage it seems easy enough to simply remove what is inside the logs folder, rather than the logs folder itself, for a quick fresh restart. I'll check if that addresses the problem, in which case I will not try to further correct the problem.
Seems to work if the logs folder itself is not removed. Won't fix.
I get a weird crash on mammouth, where the manager does not even start. The eqsub file says: /var/spool/torque/mom_priv/jobs/218623.ms.m.SC: line 1: /home/bellecp1/tmp/test_psom/logs/PIPE_history.txt: No such file or directory touch: cannot touch `/home/bellecp1/tmp/test_psom/logs/PIPE.exit': No such file or directory
I suspect this has to do somehow with NFS. In any case, the fault tolerance features should solve the problem, but it would be sweet to ensure it does not happen in the first place.