SLURM submission limited to PBSpro installs

lesleygray commented 2 years ago

Hi Marek,

Thank you for sharing your wonderful pipeline! MINTIE is working very well in local mode submitted to our queue through an interactive job.

We have had problems with the cluster implementation as BPIPE requires the qstat -x flag included in PBSpro. Our qstat install is packaged with slurm-torque 18.08.4-1.el7.

Execution Command nohup srun mintie -w -p params.txt cases/*.fastq.gz controls/*.fastq.gz & Successfully submits 'fastq_dedup' to the queue as 1 job per sample.

Error Pipeline hangs after successful completion of 'fastq_dedup'. SLURM exit status is COMPLETE and output fastq are generated.

Outputs in .bpipe/bpipe.log:

bpipe.Utils [38]    INFO    |11:57:27 Executing command: qstat -x 5451428 
bpipe.executor.TorqueStatusMonitor  [38]    WARNING |11:57:27 Error occurred in processing torque output: java.lang.Exception: Error parsing torque output: unexpected error: Unknown option: x

Environment The MINTIE installation is for version 0.3.9 installed via miniconda3/mamba. The package version are in the yaml here: mintie.yml.txt

The BPIPE scheduling configuration is as follows:

executor="slurm"

//controls the total number of procs MINTIE can spawn
//if running locally, ensure that concurrency is not to
//set to more than the number of procs available. If
//running on a cluster, this can be increased
concurrency=10

//following commands are for running on a cluster
walltime="5-20:00:00"
queue="bigmem"
mem_param="mem-per-cpu"
memory="30"
proc_mode=1
usePollerFileWatcher=true
useLegacyTorqueJobPolling=true
procs=10
account="grayl"

//add server-specific module to load
modules="miniconda3"

commands {

Thank you in advance for taking a look at this. Lesley

ssadedin commented 2 years ago

Hi @lesleygray - I see you've enabled the useLegacyTorqueJobPolling option which is indeed intended for this scenario. It seems it's not respecting that flag. To help debug it, I was wondering if you can check in the Bpipe logs - if it is recognising the flag then it should be printing into the log a message like:

Using legacy torque status polling

Are you seeing that? If you can let me know it'll help a lot to figure out why it's not obeying it in your case.

Thanks!

ssadedin commented 2 years ago

Oops, I just noticed you specified slurm as the executor, so I realised now this is definitely a bug as Bpipe should never be executing qstat when the executor is set to Slurm, rather it should be querying jobs with scontrol or sbatch etc.

Unfortunately I don't have a Slurm cluster to test with right now, but if you are willing to try it I can make a fix to this and give you an updated bpipe version to test it out. Let me know if you'd be up for that.

Thanks!

lesleygray commented 2 years ago

Sorry for the delay Simon. No I cannot see that in the logs.

Here is a snippet:

...
bpipe.executor.CustomCommandExecutor    [40]    INFO    |11:49:38 Starting command: bash /data/Bioinfo/bioinfo-resources/apps/miniconda3/miniconda3-py39/envs/lesley_WGSEnv/envs/mintie/opt
/bpipe-0.9.11/bin/../bin/bpipe-slurm.sh start
bpipe.executor.SlurmCommandExecutor     [40]    INFO    |11:49:38 Started command with id 5451428
bpipe.executor.TorqueCommandExecutor    [40]    INFO    |11:49:38 Forwarding file .bpipe/commandtmp/1/1.out
bpipe.ForwardHost       [40]    INFO    |11:49:38 Forwarding file .bpipe/commandtmp/1/1.out using forwarder bpipe.Forwarder@7c31947e
bpipe.ForwardHost       [40]    INFO    |11:49:38 Forwarding file .bpipe/commandtmp/1/1.err using forwarder bpipe.Forwarder@58ec15f2
bpipe.PipelineContext   [40]    INFO    |11:49:38 Create storage layer bpipe.storage.LocalFileSystemStorageLayer for output SP-17-4474-1A/SP-17-4474-1A.1.fastq.gz
bpipe.PipelineContext   [40]    INFO    |11:49:38 Create storage layer bpipe.storage.LocalFileSystemStorageLayer for output SP-17-4474-1A/SP-17-4474-1A.2.fastq.gz
bpipe.executor.ThrottledDelegatingCommandExecutor       [40]    INFO    |11:49:38 Waiting for command to complete before releasing 2 resources
bpipe.executor.TorqueStatusMonitor      [40]    INFO    |11:49:38 Starting torque status monitor ...
bpipe.Utils     [38]    INFO    |11:49:39 Executing command: qstat -x 5451428
bpipe.executor.TorqueStatusMonitor      [38]    WARNING |11:49:39 Error occurred in processing torque output: java.lang.Exception: Error parsing torque output: unexpected error: Unknown o
ption: x
...

I am certainly happy to do some testing, task away!

mcmero commented 2 years ago

I've also tried running a conda-installed MINTIE on a slurm cluster and I'm getting this same issue. Also tested with all bpipe versions 0.9.9.9 to 0.9.11. A manual installation may fix the problem, however that is pretty fiddly.

I'm also happy to try testing with a patched version @ssadedin.

ssadedin commented 2 years ago

Sorry all for taking a long while to followup.

In the end I realised this problem is probably addressed in a fix already in the codebase for quite a while, it's just the version of bpipe installed by default with MINTIE is a few years old.

@mcmero it would be great to validate if the latest codebase in master works with MINTIE and if so I will be releasing that officially as bpipe 0.9.12 shortly so then it would be great to include with MINTIE by default - what do you think?

Sorry again for taking ages to follow up!

lesleygray commented 2 years ago

Thanks for your response Simon. Marek, I am still happy to run some testing if needed.

mcmero commented 2 years ago

Thanks @ssadedin. Any chance you could send me a binary of the master build?

mcmero commented 2 years ago

@ssadedin I've managed to compile the latest bpipe successfully, but it's still giving the same qstat error. Any ideas?

lonsbio commented 2 years ago

Coincidently I've come across this issue on something else unrelated to MINTIE. I've been debugging with a Ubuntu VM with slurm installed, running as both head node and server, which seems to be enough to trigger an error when switching from local to slurm executor.

I've tried legacy polling too, and the only clue for me is that since Slurm extends Torque, is the useLegacyJobPolling fix overridding the config file? By my understanding of the logic, don't we want it to be true here and use the legacy, not the new pooled?

class SlurmCommandExecutor extends TorqueCommandExecutor implements CommandExecutor {

    public static final long serialVersionUID = 0L

    /**
     * Constructor
     */
    SlurmCommandExecutor() {
        super(new File(System.getProperty("bpipe.home") + "/bin/bpipe-slurm.sh"))

        // The pooled status polling only works for PBS Torque
        this.useLegacyJobPolling = false
    }

Oshlack / MINTIE

SLURM submission limited to PBSpro installs #20