alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Correct argument to limit memory during alignment? #1159

Open jdrnevich opened 3 years ago

jdrnevich commented 3 years ago

I am using a computer cluster where I have to request a specific amount of memory from the SLURM scheduler. To generate an index, I request 60 GB and then run the following code:

module load STAR/2.7.6a-IGB-gcc-8.2.0

STAR --runThreadN $SLURM_NTASKS \
     --runMode genomeGenerate \
     --genomeDir STAR-2.7.6a_GRCm39_full_Index \
     --genomeFastaFiles GCF_000001635.27_GRCm39_genomic.fna \
     --limitGenomeGenerateRAM 60000000000

(60000000000 B = 55.8794 GB so I use that shorthand to make sure I am under my requested memory)

Now I want to do the alignment of fastqs that only have 5 M reads each so I do not need very much memory. If I only request 10GB of memory for the scheduler, what argument/s do I need to give the the alignment script so that it will not use more than 10GB? I always thought it was also --limitGenomeGenerateRAM but in discussion with someone else and reading the manual, it sounds like that only applies to generating the index. Looking at the 15.11 Limits section of the manual, maybe I only need to add --limitBAMsortRAM with a specific value so it will not default to the "genome index size" like so:

STAR --runThreadN $SLURM_NTASKS \
     --genomeDir STAR-2.7.6a_GRCm39_full_Index \
     --readFilesIn sample1.fastq \
     --sjdbGTFfile GCF_000001635.27_GRCm39_genomic.gtf \
     --sjdbOverhang 99 \
     --outFileNamePrefix sample1_ \
     --outSAMtype BAM SortedByCoordinate
     --limitBAMsortRAM 10000000000 \

Would this be the correct way to limit the alignment memory? Thanks!

alexdobin commented 3 years ago

Hi Jenny,

unfortunately, it's not possible to control the amount of RAM used at alignment step - it's determined by the genome index size. A ~3Gbase genome should fit into 32GB, but I would request a bit more, say 35GB.

--limitBAMsortRAM 10000000000 is only needed if you were to use --genomeLoad shared memory options, which is not recommended for cluster jobs. Without a shared memory genome, the genome will be unloaded before sorting, and the memory used for RAM will be equal to the genome index size.

Also, I would recommend adding the GTF file at the genome generation step, not at the mapping step - this would save time and memory at the mapping step.

Cheers Alex

jdrnevich commented 3 years ago

Thanks, Alex. So we should still use --limitGenomeGenerateRAM in the alignment to indicate how much memory STAR has access to, or it gets automatically set to whatever the genome index had when it was created?

The reason we do not add the GTF file at the genome generation step is that the gene model annotation changes a lot more frequently than the genome. We are a core that analyzes lots of different species for many different researchers, and we have fewer reference indexes to maintain by not including the GTF. Though I admit we haven't done any benchmarking to see how much extra time it adds during the alignment vs. not having to remake a new index every quarter when a new Ensembl/Gencode gene set comes out!

alexdobin commented 3 years ago

Hi Jenny,

once the genome is created, STAR automatically allocates as much memory as it needs, so you do not need --limitGenomeGenerateRAM at the mapping stage.

I agree that in your case it makes perfect sense to add annotations on the fly, to avoid keeping multiple different references. It only adds a few minutes to each run, so for long enough runs it should not be a big overhead. However, it might be significant for short runs that you were talking about (5M reads?).

Cheers Alex