ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
529 stars 111 forks source link

qstat: Pbs Server is currently too busy to service this request. Please retry this request. #559

Open lnyawen opened 3 years ago

lnyawen commented 3 years ago

Hello,

I ran into a problem after running the program for two days. At the beginning of the program, there is no problem with its output, such as:

[2021-08-04T04:31:02+0800] [MainThread] [I] [toil.leader] 10 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T05:31:04+0800] [MainThread] [I] [toil.leader] 10 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T06:31:05+0800] [MainThread] [I] [toil.leader] 10 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T07:31:07+0800] [MainThread] [I] [toil.leader] 9 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T08:31:08+0800] [MainThread] [I] [toil.leader] 9 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T09:31:09+0800] [MainThread] [I] [toil.leader] 9 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T10:31:10+0800] [MainThread] [I] [toil.leader] 9 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T11:31:11+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T12:31:11+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T13:31:12+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T14:31:12+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T15:31:13+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T16:31:14+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T17:31:15+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T18:31:15+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T19:31:16+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T20:31:18+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T21:31:19+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T22:31:19+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-04T23:31:20+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run
[2021-08-05T00:31:21+0800] [MainThread] [I] [toil.leader] 8 jobs are running, 0 jobs are issued and waiting to run

However, after running the program for two days, I encountered some errors, like this:

qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601294.mu01
[2021-08-05T00:43:11+0800] [Thread-2  ] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Will retry errored operation getJobExitCode, code 30: qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601294.mu01

qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601294.mu01
[2021-08-05T00:44:29+0800] [Thread-2  ] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Will retry errored operation getJobExitCode, code 30: qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601294.mu01

qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01
[2021-08-05T00:44:30+0800] [Thread-2  ] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Will retry errored operation getJobExitCode, code 30: qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01

qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01
[2021-08-05T00:44:31+0800] [Thread-2  ] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Will retry errored operation getJobExitCode, code 30: qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01

qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01
[2021-08-05T00:44:33+0800] [Thread-2  ] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Failed operation getJobExitCode, code 30: qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01

[2021-08-05T00:44:33+0800] [Thread-2  ] [E] [toil.batchSystems.abstractGridEngineBatchSystem] GridEngine like batch system failure
Traceback (most recent call last):
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 222, in run
    while self._runStep():
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 212, in _runStep
    activity |= self.checkOnJobs()
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 187, in checkOnJobs
    status = self.boss.with_retries(self.getJobExitCode, batchJobID)
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 435, in with_retries
    raise err
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 426, in with_retries
    return operation(*args, **kwargs)
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/batchSystems/torque.py", line 130, in getJobExitCode
    stdout = call_command(args)
  File "/gpfs/home/liunyw/dragon_cactus/soft/cactus-bin-v2.0.3/venv/lib/python3.6/site-packages/toil/lib/misc.py", line 67, in call_command
    raise CalledProcessErrorStderr(proc.returncode, cmd, output=stdout, stderr=stderr)
toil.lib.misc.CalledProcessErrorStderr: Command '['qstat', '-f', '601296']' exit status 30: qstat: Pbs Server is currently too busy to service this request. Please retry this request. 601296.mu01

The command used is

cactus-prepare-toil  ./jobstore Dragonfly.seqFile \
  --binariesMode local \
  --batchSystem torque \
  --realTimeLogging \
  --outDir /pwd/outdir \
  --workDir /pwd/workdir \
  --outHal /pwd/Dragonfly.hal \
  --disableAutoDeployment \
  --disableCaching \
  --maxNodes 16 \
  --minNodes 1 \
  --preprocessCores 28 \
  --blastCores 28 \
  --alignCores 28 \
  --preprocessMemory 200G \
  --blastMemory 200G \
  --alignMemory 200G \
  --preprocessDisk 2000G \
  --blastDisk 2000G \
  --alignDisk 2000G \
  --halAppendDisk 2000G \
  --preprocessPreemptible 28 \
  --blastPreemptible 28 \
  --alignPreemptible 28 \
  --halAppendPreemptible 28 \
  --stats

Could you give me some advice? Thanks

Best, Yawen

lnyawen commented 3 years ago

Hello,

I found that the bugs is related to my PBS system. I added --restart to the running script, and cactus is running again.