Open lesleygray opened 2 years ago
Hi @lesleygray - I see you've enabled the useLegacyTorqueJobPolling
option which is indeed intended for this scenario. It seems it's not respecting that flag. To help debug it, I was wondering if you can check in the Bpipe logs - if it is recognising the flag then it should be printing into the log a message like:
Using legacy torque status polling
Are you seeing that? If you can let me know it'll help a lot to figure out why it's not obeying it in your case.
Thanks!
Oops, I just noticed you specified slurm
as the executor, so I realised now this is definitely a bug as Bpipe should never be executing qstat
when the executor is set to Slurm, rather it should be querying jobs with scontrol
or sbatch
etc.
Unfortunately I don't have a Slurm cluster to test with right now, but if you are willing to try it I can make a fix to this and give you an updated bpipe version to test it out. Let me know if you'd be up for that.
Thanks!
Sorry for the delay Simon. No I cannot see that in the logs.
Here is a snippet:
...
bpipe.executor.CustomCommandExecutor [40] INFO |11:49:38 Starting command: bash /data/Bioinfo/bioinfo-resources/apps/miniconda3/miniconda3-py39/envs/lesley_WGSEnv/envs/mintie/opt
/bpipe-0.9.11/bin/../bin/bpipe-slurm.sh start
bpipe.executor.SlurmCommandExecutor [40] INFO |11:49:38 Started command with id 5451428
bpipe.executor.TorqueCommandExecutor [40] INFO |11:49:38 Forwarding file .bpipe/commandtmp/1/1.out
bpipe.ForwardHost [40] INFO |11:49:38 Forwarding file .bpipe/commandtmp/1/1.out using forwarder bpipe.Forwarder@7c31947e
bpipe.ForwardHost [40] INFO |11:49:38 Forwarding file .bpipe/commandtmp/1/1.err using forwarder bpipe.Forwarder@58ec15f2
bpipe.PipelineContext [40] INFO |11:49:38 Create storage layer bpipe.storage.LocalFileSystemStorageLayer for output SP-17-4474-1A/SP-17-4474-1A.1.fastq.gz
bpipe.PipelineContext [40] INFO |11:49:38 Create storage layer bpipe.storage.LocalFileSystemStorageLayer for output SP-17-4474-1A/SP-17-4474-1A.2.fastq.gz
bpipe.executor.ThrottledDelegatingCommandExecutor [40] INFO |11:49:38 Waiting for command to complete before releasing 2 resources
bpipe.executor.TorqueStatusMonitor [40] INFO |11:49:38 Starting torque status monitor ...
bpipe.Utils [38] INFO |11:49:39 Executing command: qstat -x 5451428
bpipe.executor.TorqueStatusMonitor [38] WARNING |11:49:39 Error occurred in processing torque output: java.lang.Exception: Error parsing torque output: unexpected error: Unknown o
ption: x
...
I am certainly happy to do some testing, task away!
I've also tried running a conda-installed MINTIE on a slurm cluster and I'm getting this same issue. Also tested with all bpipe versions 0.9.9.9 to 0.9.11. A manual installation may fix the problem, however that is pretty fiddly.
I'm also happy to try testing with a patched version @ssadedin.
Sorry all for taking a long while to followup.
In the end I realised this problem is probably addressed in a fix already in the codebase for quite a while, it's just the version of bpipe installed by default with MINTIE is a few years old.
@mcmero it would be great to validate if the latest codebase in master works with MINTIE and if so I will be releasing that officially as bpipe 0.9.12 shortly so then it would be great to include with MINTIE by default - what do you think?
Sorry again for taking ages to follow up!
Thanks for your response Simon. Marek, I am still happy to run some testing if needed.
Thanks @ssadedin. Any chance you could send me a binary of the master build?
@ssadedin I've managed to compile the latest bpipe successfully, but it's still giving the same qstat error. Any ideas?
Coincidently I've come across this issue on something else unrelated to MINTIE. I've been debugging with a Ubuntu VM with slurm installed, running as both head node and server, which seems to be enough to trigger an error when switching from local
to slurm
executor.
I've tried legacy polling too, and the only clue for me is that since Slurm extends Torque, is the useLegacyJobPolling
fix overridding the config file? By my understanding of the logic, don't we want it to be true
here and use the legacy, not the new pooled?
class SlurmCommandExecutor extends TorqueCommandExecutor implements CommandExecutor {
public static final long serialVersionUID = 0L
/**
* Constructor
*/
SlurmCommandExecutor() {
super(new File(System.getProperty("bpipe.home") + "/bin/bpipe-slurm.sh"))
// The pooled status polling only works for PBS Torque
this.useLegacyJobPolling = false
}
Hi Marek,
Thank you for sharing your wonderful pipeline! MINTIE is working very well in local mode submitted to our queue through an interactive job.
We have had problems with the cluster implementation as BPIPE requires the qstat -x flag included in PBSpro. Our qstat install is packaged with slurm-torque 18.08.4-1.el7.
Execution Command
nohup srun mintie -w -p params.txt cases/*.fastq.gz controls/*.fastq.gz &
Successfully submits 'fastq_dedup' to the queue as 1 job per sample.Error Pipeline hangs after successful completion of 'fastq_dedup'. SLURM exit status is COMPLETE and output fastq are generated.
Outputs in .bpipe/bpipe.log:
Environment The MINTIE installation is for version 0.3.9 installed via miniconda3/mamba. The package version are in the yaml here: mintie.yml.txt
The BPIPE scheduling configuration is as follows:
Thank you in advance for taking a look at this. Lesley