Open kevinrue opened 1 week ago
Cc @DongzeHE @jamshed. In general, if you are seeing long runtimes on your HPC, it's likely related to filesystem interactions with networked disks. Specifically, you should make sure that you are executing the indexing command and writing the resulting index to a local disk (e.g. scratch or tmp), then copying the index over to a result directory. The indexing procedure creates many small intermediate files (which we are looking to address), but this really messes with networked file systems, so you should make sure index construction doesn't happen on or write to networked disks.
Thanks! I'll check with our IT team how I might be able to optimise this. Feel free to close the issue. I might report back here with an update on improvement and - if applicable - advice to others.
Actually, quick follow up:
When you mention intermediate files, do you refer to those under the directory workdir.noindex
?
If so, does this directory automatically shows up in the working directory? Is there a way to make it appear elsewhere? I don't see any related argument in https://simpleaf.readthedocs.io/en/latest/index-command.html
You only mention writing the resulting index file to a local disk, but it sounds like those temporary files are good candidate too.
Thanks!
Yes, those temporary files are the main offenders (more than the index itself). They appear in the execution directory. We do have a flag to set the work directory, but it is not exposed in simpleaf yet (it's on the dev branch and will be in the next release, but we are waiting on one or two other features for the next release).
Right, so rather than the index file, I could make my job change directory to the TMPDIR, and run the command from there. I'll try, but definitely looking forward to future versions taking care of that automagically :)
Pardon the convoluted Snakemake, but here goes:
rule alevin_build_reference_index:
input:
genome="resources/genome/genome.fa.gz",
gtf="resources/genome/genome.gtf.gz",
output:
index=directory("resources/genome/index/alevin"),
log:
out="logs/alevin/build_reference_index.out",
err="logs/alevin/build_reference_index.err",
threads: 16
resources:
mem="8G",
runtime="1h",
shell:
"jobdir=$(pwd) &&"
" cd $TMPDIR &&"
" export ALEVIN_FRY_HOME=af_home &&"
" simpleaf set-paths &&"
" gunzip -c $jobdir/{input.genome} > tmp_alevin_index.fa &&"
" simpleaf index"
" --output $jobdir/{output.index}"
" --fasta tmp_alevin_index.fa"
" --gtf $jobdir/{input.gtf}"
" --rlen 150"
" --threads 16"
" --use-piscem"
" > $jobdir/{log.out} 2> $jobdir/{log.err}"
In short: changing directory to $TMPDIR
(a local folder given to each Slurm job on our HPC), and running the simpleaf index
command from there, so that workdir.noindex/
and all those small intermediate files get created in a local tempdir.
Seems to have shortened the job from 3 hours down to 50 min.
Ah and another attempt just completed where I've just edited the rule above to also produce the output directory in the local tempdir before copying it back to the network drive, still 50 min.
Not sure if there's anything else going on making Drosophila somehow longer to process than human. I hear there are some overlapping transcripts that might be a source of trouble.
Cross-posting from https://www.reddit.com/r/bioinformatics/comments/1g6zfu6/simpleaf_index_long_runtime/
Is there some guidance about the expected runtime of
simpleaf index
anywhere?The post above reports 20 min runtime for human using 16 CPUs.
In my current situation, Drosophila has a genome of approx. 180 Mb and my HPC job with 16 CPUs timed out after an hour.
PS: my command is
simpleaf index --output resources/genome/index/alevin --fasta tmp_alevin_index.fa --gtf resources/genome/genome.gtf.gz --rlen 150 --threads 16 --use-piscem
In particular, I've set
--rlen 150
based on the length of my scRNAseq reads. Is that alright?Thanks!