benedictpaten / jobTree

Python based pipeline management software for clusters (but checkout toil: https://github.com/BD2KGenomics/toil, its successor)
MIT License
24 stars 18 forks source link

entire jobTree crashes when a job's pickle file isn't written correctly #26

Open joelarmstrong opened 9 years ago

joelarmstrong commented 9 years ago

I just had several jobTrees fail (presumably due to a filesystem problem) with this error:

Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256
There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job
Traceback (most recent call last):
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
    main()
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
    Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
    return mainLoop(config, batchSystem)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
    processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
    job = Job.read(jobFile)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
    job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected

I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.

diekhans commented 9 years ago

a) atomic file creation will prevent partial file from being present on error b) unless you are out of disk space [handled by a] we should not be having file systems problems. It's very reliable.

Joel Armstrong notifications@github.com writes:

I just had several jobTrees fail (presumably due to a filesystem problem) with this error:

Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256 No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3 We have set the default memory of the failed job to 8589934593.0 bytes Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256 No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3 We have set the default memory of the failed job to 8589934593.0 bytes Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256 There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job Traceback (most recent call last): File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in main() File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options) File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree return mainLoop(config, batchSystem) File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config) File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob job = Job.read(jobFile) File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read job = _convertJsonJobToJob(pickler.load(fileHandle)) EOFError: EOF read where object expected

I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.

— Reply to this email directly or view it on GitHub.*

joelarmstrong commented 9 years ago

Didn't mean to imply that GPFS itself was truncating the file somehow -- just that there were issues with read/writes blocking for long times that could've easily contributed to the process itself writing only a partial file and triggering a previously unnoticed race condition, or simply getting killed at an inopportune time.

Atomic file creation is a great idea for this file, I think that's the correct fix here.

benedictpaten commented 9 years ago

We already do try to do this, see:

https://github.com/benedictpaten/jobTree/blob/master/src/job.py#L43

On Mon, Feb 2, 2015 at 9:58 AM, Joel Armstrong notifications@github.com wrote:

Didn't mean to imply that GPFS itself was truncating the file somehow -- just that there were issues with read/writes blocking for long times that could've easily contributed to the process itself writing only a partial file and triggering a previously unnoticed race condition, or simply getting killed at an inopportune time.

Atomic file creation is a great idea for this file, I think that's the correct fix here.

— Reply to this email directly or view it on GitHub https://github.com/benedictpaten/jobTree/issues/26#issuecomment-72504253 .

joelarmstrong commented 9 years ago

Oh wow, hadn't looked at the code yet. That seems pretty foolproof. Could this have the same root cause as the issue we had with the cactus sequences file? I'm stumped.

benedictpaten commented 9 years ago

Actually, the code can be simplified. The ".updating" file was there because of how we used to do it, but it can be removed and we should be able to just rely on the move of the ".new" version to replace the previous version of the file. Doing this would be good, but I'd need to look through the other code to see how we check for the existence of the ".updating" file.

If the filesystem itself went wrong I think all bets are off.

On Mon, Feb 2, 2015 at 10:23 AM, Joel Armstrong notifications@github.com wrote:

Oh wow, hadn't looked at the code yet. That seems pretty foolproof. Could this have the same root cause as the issue we had with the cactus sequences file? I'm stumped.

— Reply to this email directly or view it on GitHub https://github.com/benedictpaten/jobTree/issues/26#issuecomment-72509031 .

tsackton commented 9 years ago

I have just encountered the same error in jobTree as part of a progressiveCactus run. The error message is below:

Batch system is reporting that the job (1, 4907) /n/home12/tsackton/ratite_scratch/wga/ratite_alignment/ratiteDir/jobTree/jobs/t1/t1/t1/t3/t3/t0/t0/t1/t1/job failed with exit value 1
There was a .new file for the job and no .updating file /n/home12/tsackton/ratite_scratch/wga/ratite_alignment/ratiteDir/jobTree/jobs/t1/t1/t1/t3/t3/t0/t0/t1/t1/job
Traceback (most recent call last):
  File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
    main()
  File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
    Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
  File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
    return mainLoop(config, batchSystem)
  File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
    processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
  File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
    job = Job.read(jobFile)
  File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
    job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected

It is possible that there is some filesystem issues that are in play here, as our main shared filesystem had some recent instability, but at least in theory that was fixed >24 hours ago and after I started the progressiveCactus run. I should also note that this is using the Harvard informatics SLURM fork of jobTree (https://github.com/harvardinformatics/jobTree), but I think the relevant code here is unchanged.

Edited to add: the directory with the failed job identified above has a 0-length job file and a 0-length updating file in it, in case that is relevant.