simpleaf index: runtime expectations

kevinrue commented 1 week ago

Cross-posting from https://www.reddit.com/r/bioinformatics/comments/1g6zfu6/simpleaf_index_long_runtime/

Is there some guidance about the expected runtime of simpleaf index anywhere?

The post above reports 20 min runtime for human using 16 CPUs.

In my current situation, Drosophila has a genome of approx. 180 Mb and my HPC job with 16 CPUs timed out after an hour.

Is there a rule of thumb that can help users guesstimate runtime based on genome size and/or annotated features?
Is there guidance on reasonable range of values for the number of CPU (maximum after which more CPUs don't help much)
Any other guidance on sanity checks and steps users can take to optimise performance and runtime?

PS: my command is simpleaf index --output resources/genome/index/alevin --fasta tmp_alevin_index.fa --gtf resources/genome/genome.gtf.gz --rlen 150 --threads 16 --use-piscem

In particular, I've set --rlen 150 based on the length of my scRNAseq reads. Is that alright?

Thanks!

rob-p commented 1 week ago

Cc @DongzeHE @jamshed. In general, if you are seeing long runtimes on your HPC, it's likely related to filesystem interactions with networked disks. Specifically, you should make sure that you are executing the indexing command and writing the resulting index to a local disk (e.g. scratch or tmp), then copying the index over to a result directory. The indexing procedure creates many small intermediate files (which we are looking to address), but this really messes with networked file systems, so you should make sure index construction doesn't happen on or write to networked disks.

kevinrue commented 1 week ago

Thanks! I'll check with our IT team how I might be able to optimise this. Feel free to close the issue. I might report back here with an update on improvement and - if applicable - advice to others.

kevinrue commented 1 week ago

Actually, quick follow up:

When you mention intermediate files, do you refer to those under the directory workdir.noindex ?

If so, does this directory automatically shows up in the working directory? Is there a way to make it appear elsewhere? I don't see any related argument in https://simpleaf.readthedocs.io/en/latest/index-command.html

You only mention writing the resulting index file to a local disk, but it sounds like those temporary files are good candidate too.

Thanks!

rob-p commented 1 week ago

Yes, those temporary files are the main offenders (more than the index itself). They appear in the execution directory. We do have a flag to set the work directory, but it is not exposed in simpleaf yet (it's on the dev branch and will be in the next release, but we are waiting on one or two other features for the next release).

kevinrue commented 1 week ago

Right, so rather than the index file, I could make my job change directory to the TMPDIR, and run the command from there. I'll try, but definitely looking forward to future versions taking care of that automagically :)

kevinrue commented 1 week ago

Pardon the convoluted Snakemake, but here goes:

rule alevin_build_reference_index:
    input:
        genome="resources/genome/genome.fa.gz",
        gtf="resources/genome/genome.gtf.gz",
    output:
        index=directory("resources/genome/index/alevin"),
    log:
        out="logs/alevin/build_reference_index.out",
        err="logs/alevin/build_reference_index.err",
    threads: 16
    resources:
        mem="8G",
        runtime="1h",
    shell:
        "jobdir=$(pwd) &&"
        " cd $TMPDIR &&"
        " export ALEVIN_FRY_HOME=af_home &&"
        " simpleaf set-paths &&"
        " gunzip -c $jobdir/{input.genome} > tmp_alevin_index.fa  &&"
        " simpleaf index"
        " --output $jobdir/{output.index}"
        " --fasta tmp_alevin_index.fa"
        " --gtf $jobdir/{input.gtf}"
        " --rlen 150"
        " --threads 16"
        " --use-piscem"
        " > $jobdir/{log.out} 2> $jobdir/{log.err}"

In short: changing directory to $TMPDIR (a local folder given to each Slurm job on our HPC), and running the simpleaf index command from there, so that workdir.noindex/ and all those small intermediate files get created in a local tempdir.

Seems to have shortened the job from 3 hours down to 50 min.

Ah and another attempt just completed where I've just edited the rule above to also produce the output directory in the local tempdir before copying it back to the network drive, still 50 min.

Not sure if there's anything else going on making Drosophila somehow longer to process than human. I hear there are some overlapping transcripts that might be a source of trouble.

COMBINE-lab / simpleaf

simpleaf index: runtime expectations #166