TheRoddyWMS / Roddy

The Roddy workflow development and management system.
http://roddy-documentation.readthedocs.io
MIT License
8 stars 3 forks source link

Roddy dies with stack trace if submission/queue limit of LSF cluster is reached #161

Closed vinjana closed 7 years ago

vinjana commented 7 years ago

If more jobs are submitted than allowed for the user, Roddy dies with a stack trace:

A workflow error occurred, try to rollback / abort submitted jobs.
bkill 14843 14844 14848 14849 14850 14852 14856
An unknown / unhandled exception occurred: 'Could not parse raw ID from: 'Group <resUsers>: Pending job threshold reached. Retrying in 60 seconds...''
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.parseJobID(LSFJobManager.groovy:611)
de.dkfz.roddy.execution.jobs.BatchEuphoriaJobManager.extractAndSetJobResultFromExecutionResult(BatchEuphoriaJobManager.groovy:143)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.runJob(LSFJobManager.groovy:148)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$runJob.call(Unknown Source)
de.dkfz.roddy.execution.jobs.Job.run(Job.groovy:534)
de.dkfz.roddy.knowledge.methods.GenericMethod.createAndRunSingleJob(GenericMethod.groovy:507)
de.dkfz.roddy.knowledge.methods.GenericMethod._callGenericToolOrToolArray(GenericMethod.groovy:255)
de.dkfz.roddy.knowledge.methods.GenericMethod.callGenericTool(GenericMethod.groovy:50)
de.dkfz.b080.co.files.CoverageTextFile.plot(CoverageTextFile.java:49)
de.dkfz.b080.co.files.CoverageTextFileGroup.plot(CoverageTextFileGroup.java:47)
de.dkfz.b080.co.qcworkflow.QCPipeline.execute(QCPipeline.groovy:90)
de.dkfz.roddy.core.ExecutionContext.execute(ExecutionContext.groovy:625)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:397)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:341)
de.dkfz.roddy.core.Analysis.rerun(Analysis.java:229)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.rerun(RoddyCLIClient.groovy:513)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.parseStartupMode(RoddyCLIClient.groovy:116)
de.dkfz.roddy.Roddy.parseRoddyStartupModeAndRun(Roddy.java:721)
de.dkfz.roddy.Roddy.startup(Roddy.java:289)
de.dkfz.roddy.Roddy.main(Roddy.java:216)

The reason here is an interaction of Roddy and BE. It seems BE searches for a specific line in the output that is not found. Instead returns the wait-notice above. When s.th. like this happens on the command line upon manual submission, bsub blocks.

dankwart-de commented 7 years ago

That should actually be catched and the jobs should be aborted.. Why does that not happen?

Am 27. November 2017 1:52:49 nachm. schrieb Philip Reiner Kensche notifications@github.com:

If more jobs are submitted than allowed for the user, Roddy dies with a stack trace:

A workflow error occurred, try to rollback / abort submitted jobs.
bkill 14843 14844 14848 14849 14850 14852 14856
An unknown / unhandled exception occurred: 'Could not parse raw ID from: 
'Group <resUsers>: Pending job threshold reached. Retrying in 60 seconds...''
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.parseJobID(LSFJobManager.groovy:611)
de.dkfz.roddy.execution.jobs.BatchEuphoriaJobManager.extractAndSetJobResultFromExecutionResult(BatchEuphoriaJobManager.groovy:143)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.runJob(LSFJobManager.groovy:148)
de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$runJob.call(Unknown 
Source)
de.dkfz.roddy.execution.jobs.Job.run(Job.groovy:534)
de.dkfz.roddy.knowledge.methods.GenericMethod.createAndRunSingleJob(GenericMethod.groovy:507)
de.dkfz.roddy.knowledge.methods.GenericMethod._callGenericToolOrToolArray(GenericMethod.groovy:255)
de.dkfz.roddy.knowledge.methods.GenericMethod.callGenericTool(GenericMethod.groovy:50)
de.dkfz.b080.co.files.CoverageTextFile.plot(CoverageTextFile.java:49)
de.dkfz.b080.co.files.CoverageTextFileGroup.plot(CoverageTextFileGroup.java:47)
de.dkfz.b080.co.qcworkflow.QCPipeline.execute(QCPipeline.groovy:90)
de.dkfz.roddy.core.ExecutionContext.execute(ExecutionContext.groovy:625)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:397)
de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:341)
de.dkfz.roddy.core.Analysis.rerun(Analysis.java:229)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.rerun(RoddyCLIClient.groovy:513)
de.dkfz.roddy.client.cliclient.RoddyCLIClient.parseStartupMode(RoddyCLIClient.groovy:116)
de.dkfz.roddy.Roddy.parseRoddyStartupModeAndRun(Roddy.java:721)
de.dkfz.roddy.Roddy.startup(Roddy.java:289)
de.dkfz.roddy.Roddy.main(Roddy.java:216)

The reason here is an interaction of Roddy and BE. It seems BE searches for a specific line in the output that is not found. Instead returns the wait-notice above. When s.th. like this happens on the command line upon manual submission, bsub blocks.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/eilslabs/Roddy/issues/161

vinjana commented 7 years ago

The job is aborted (bkill). The first problem is not that the handling of the job-failure is not o.k. (although this also is the problem), but that the error actually occurs. Roddy needs to be able to handle full submission queues -- either through blocking and waiting or through exit and rollback (like now, but w/o stacktrace), or by exit w/o rollback.

dankwart-de commented 7 years ago

Is it really a Roddy problem? In the end yes, but at first, the error should be catched by the LSF job manager right? The error is there.

dankwart-de commented 7 years ago

I moved it to BE and close it here. BE will need to throw a proper exception, Roddy needs to catch that then.