alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Loading genome is really slow #1551

Closed SepRah closed 2 years ago

SepRah commented 2 years ago

@alexdobin

Hey Alex,

I'm using the STAR-aligner (Version 2.7.9a) to map human RNA-samples to a reference genome and to detect fusions for a self build Pipeline. The indexing of my reference genome (grch37) worked fine and also the mapping itself works great and is fast (~3-4 min.). The only problem is that the aligner takes its time to load the genome with SA (almost 20-30 min). I noticed this long loading times because I'm pretty sure it was faster when I used the STAR aligner some time ago (without changing anything). I'm using the following command in my snakemake pipeline: """(mkdir -p 2-bam/STAR_files/{wildcards.unit} /media/scratch/tools/TK/STAR-2.7.9a/bin/Linux_x86_64/STAR --runThreadN 5 --genomeDir {params.genomefiles} --limitBAMsortRAM 31000000000 --readFilesIn {input} --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --outFilterMismatchNoverLmax 0.3 --alignIntronMax 500000 --alignMatesGapMax 500000 --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 --chimMainSegmentMultNmax 10 --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --chimSegmentMin 10 --chimOutType WithinBAM --outFileNamePrefix 2-bam/STAR_files/{wildcards.unit}/{wildcards.unit}_) 2> {log} """

The server I'm using for that is pretty powerful and handled that before (as I mentioned). I also tried to run it with more threads but it doesn't change much. Do you have any suggestions? I would be delighted to receive a fast response!

Best, Sep

alexdobin commented 2 years ago

Hi Sep,

I made no changes to the genome loading routine in a long time. The genome loading speed depends mostly on the disk I/O. If you have network-based storage, it could be slow, but the local disks should be very fast.

SepRah commented 2 years ago

Hey Alex,

Thanks for your respond! The main problem in my case is that Loading the SA takes very long (I guess because it's the biggest file in the genome folder). Unfortunately, I can't use a local disk since it has not the right partitions for the tmp.files. Do you have any other suggestions?

Best, Sep

alexdobin commented 2 years ago

Hi Sep,

are you running it on a single server (node) or cluster? In the former case, you can use --genomeLoad LoadAndKeep option in all your runs. For the first run, it will load the genome into shared RAM from disk, but for all consecutive runs it will keep it in RAM. After all your runs are finished, use --genomeLoad Remove to remove the genome from RAM.

SepRah commented 2 years ago

Hey Alex,

Thanks! That worked!