TheRoddyWMS / BatchEuphoria

A library to access different kinds of cluster backends
MIT License
3 stars 5 forks source link

BatchEuphoria does not handle full queues properly. #79

Closed dankwart-de closed 4 years ago

dankwart-de commented 6 years ago

If more jobs are submitted than allowed for the user, Roddy dies with a stack trace. However, this is a BE problem and BE should throw a proper exception, which could then be catched by clients.

A workflow error occurred, try to rollback / abort submitted jobs. bkill 14843 14844 14848 14849 14850 14852 14856 An unknown / unhandled exception occurred: 'Could not parse raw ID from: 'Group : Pending job threshold reached. Retrying in 60 seconds...'' de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.parseJobID(LSFJobManager.groovy:611) de.dkfz.roddy.execution.jobs.BatchEuphoriaJobManager.extractAndSetJobResultFromExecutionResult(BatchEuphoriaJobManager.groovy:143) de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager.runJob(LSFJobManager.groovy:148) de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager$runJob.call(Unknown Source) de.dkfz.roddy.execution.jobs.Job.run(Job.groovy:534) de.dkfz.roddy.knowledge.methods.GenericMethod.createAndRunSingleJob(GenericMethod.groovy:507) de.dkfz.roddy.knowledge.methods.GenericMethod._callGenericToolOrToolArray(GenericMethod.groovy:255) de.dkfz.roddy.knowledge.methods.GenericMethod.callGenericTool(GenericMethod.groovy:50) de.dkfz.b080.co.files.CoverageTextFile.plot(CoverageTextFile.java:49) de.dkfz.b080.co.files.CoverageTextFileGroup.plot(CoverageTextFileGroup.java:47) de.dkfz.b080.co.qcworkflow.QCPipeline.execute(QCPipeline.groovy:90) de.dkfz.roddy.core.ExecutionContext.execute(ExecutionContext.groovy:625) de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:397) de.dkfz.roddy.core.Analysis.executeRun(Analysis.java:341) de.dkfz.roddy.core.Analysis.rerun(Analysis.java:229) de.dkfz.roddy.client.cliclient.RoddyCLIClient.rerun(RoddyCLIClient.groovy:513) de.dkfz.roddy.client.cliclient.RoddyCLIClient.parseStartupMode(RoddyCLIClient.groovy:116) de.dkfz.roddy.Roddy.parseRoddyStartupModeAndRun(Roddy.java:721) de.dkfz.roddy.Roddy.startup(Roddy.java:289) de.dkfz.roddy.Roddy.main(Roddy.java:216) The reason here is an interaction of Roddy and BE. It seems BE searches for a specific line in the output that is not found. Instead returns the wait-notice above. When s.th. like this happens on the command line upon manual submission, bsub blocks.

vinjana commented 6 years ago

Maybe the submission actually waited, but only the first line was parsed (which contained the wait output). Changing the parser to parse the last line would then fix this.

vinjana commented 6 years ago

The issue will be closed as soon as a test exists (with multi-line input joined by '\n').

askask commented 5 years ago

With the environment variable LSB_NTRIES it is possible to set the number of attempts if the LSF server is not reachable or the maximum number of pending jobs is reached.

I would suggest to set this to 1 for all LSF commands.