species_separator run as Slurm sbatch job on HPC exists prematurely

rbatorsky commented 4 years ago

Hello and thanks for the great software! I run Sargasso on an HPC which uses the Slurm job scheduler. I notice that when I run a batch job using sbatch, the jobs exit prematurely. When I run in interactive mode, they complete.

The code I am running looks like this:

SARG_TEST=/cluster/tufts/bio/data/Sargasso/pipeline_test/
HUMAN_INDEX=/cluster/tufts/bio/data/genomes/HomoSapiens/UCSC/hg38/Sequence/STAR/
MOUSE_INDEX=/cluster/tufts/bio/data/genomes/Mus_musculus/UCSC/mm10/Sequence/STAR/

out=test_results_sbatch
species_separator rnaseq \
    --reads-base-dir=$SARG_TEST/data/fastq/ \
    --best --run-separation \
    $SARG_TEST/data/rnaseq.tsv $out \
    human $HUMAN_INDEX \
    mouse $MOUSE_INDEX

The stdout looks like this when it exists prematurely shows that it stops after the collate_raw_reads step. The log:

mkdir -p mapper_indexes
ln -s /cluster/tufts/bio/data/genomes/HomoSapiens/UCSC/hg38/Sequence/STAR/ mapper_indexes/human
mkdir -p mapper_indexes
ln -s /cluster/tufts/bio/data/genomes/Mus_musculus/UCSC/mm10/Sequence/STAR/ mapper_indexes/mouse
# Create a directory with sub-directories for each sample, each of which
# contains links to the input raw reads files for that sample
mkdir -p raw_reads
collate_raw_reads "our_sample" /cluster/tufts/bio/data/Sargasso/pipeline_test//data/fastq/ raw_reads paired "mouse_rat_test_1.fastq.gz" "mouse_rat_test_2.fastq.gz"

Right now I am getting around it by putting a sleep command at the end of the script in order to keep the job alive. This isn't ideal because I don't know in advance how long it will take and want to be mindful of cluster resources. Do you have any idea why the job is exiting prematurely in batch mode or how I can keep it doing until it is done?

Thanks again. Rebecca

lweasel commented 4 years ago

Hi Rebecca,

Many thanks for trying out the software! I haven't actually run Sargasso using a job scheduler before, so it would be greatly helpful to us to be able to get this working for you in this new situation.

Can I just confirm that when you run in interactive mode and the tool completes, that all the expected output files are present, as described here - e.g. that the filtered_reads directory contains a BAM file for each sample and species?

It's strange that when running in batch mode the tool exits prematurely, but my first guess would be that it may be an interaction between the scheduler and the particular way in which Sargasso executes its commands. Basically the main python code (the "species_separator" invocation) writes a Makefile into the output directory, and then opens a subprocess to execute "make" using that Makefile. Subsequently, the "make" execution will call various other bash scripts and Python code as it works through the Makefile.

My initial guess is that because the main python code exits once it has initiated the execution of "make", perhaps the job scheduler is noticing this, and thus thinking that all of the execution is finished, and is thus terminating the job while the Makefile is still being executed. That could explain why adding a "sleep" allows you to work around this.

I think that we might hopefully test this in the following way. If you remove the "--run-separation" flag from the "species_separator" invocation, then the Makefile will be written, but it will not be executed. Then you could manually add execution of that Makefile with "make" as a subsequent step to the script that you are submitting to the job scheduler. In terms of Sargasso operation, this will behave exactly as if the Makefile had been executed automatically with "--run-separation", but it would mean that the job scheduler would at least know about the make invocation, and would thus hopefully not prematurely terminate the job? Would it be possible to try that as a test?

Best regards, Owen

rbatorsky commented 4 years ago

Hi and thanks for the helpful response. That does fix the problem, although I don't really understand why. I added these lines to my script and it completes without the sleep command and with all the expected steps and bam files.

cd $OUTDIR
make

I'll ask our sys admins if they have any insight into the behavior.

Thanks! Rebecca

lweasel commented 4 years ago

That's great that it works now.

My hunch is that Slurm is not aware of the subprocess that Sargasso starts in order to run make, and so terminates the whole batch job as soon as the main python process finishes (because it thinks that everything is done) - but as you suggest your sys admins may have a better insight into that! If they do have any clues it would brilliant to know, and then I'll update the documentation to give this tip for running the tool under a batch scheduler.

Many thanks, Owen

biomedicalinformaticsgroup / Sargasso

species_separator run as Slurm sbatch job on HPC exists prematurely #105