alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

Aligning Oligos #1758

Open gf-atebbe opened 1 year ago

gf-atebbe commented 1 year ago

We're trying to optimize alignments of oligos (~ 21 bp). We tried setting --sjdbOverhang and --genomeSAindexNbases using STAR 2.7.10a. We are trying to build the index using the NCBI human reference genome using the following files:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

We ran the following three sets of values for --sjdbOverhang and --genomeSAindexNbases. All of these were run on a r6a.48xlarge instance. If we set --genomeSAindexNbases to anything larger than 16 it fails. Examples of the two failures are below - out of memory or a potential bug?

Does it make sense to increase both of these parameters to try to reduce the runtimes of alignments?

STAR \
--runThreadN 23 \
--runMode genomeGenerate \
--genomeDir /mnt/data/overhangs \
--genomeFastaFiles /mnt/data/GCF_000001405.40_GRCh38.p14_genomic.fna \
--sjdbGTFfile /mnt/data/GCF_000001405.40_GRCh38.p14_genomic.gtf \
--sjdbOverhang 29 \
--genomeSAindexNbases 29

Feb 09 19:33:15 ..... started STAR run Feb 09 19:33:15 ... starting to generate Genome files Feb 09 19:33:56 ..... processing annotations GTF !!!!! WARNING: --genomeSAindexNbases 29 is too large for the genome size=3298430636, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 14 Feb 09 19:34:34 ... starting to sort Suffix Array. This may take a long time... Feb 09 19:35:31 ... sorting Suffix Array chunks and saving them to disk... Feb 09 19:54:02 ... loading chunks from disk, packing SA... Feb 09 19:55:25 ... finished generating suffix array terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

STAR \
--runThreadN 23 \
--runMode genomeGenerate \
--genomeDir /mnt/data/overhangs \
--genomeFastaFiles /mnt/data/GCF_000001405.40_GRCh38.p14_genomic.fna \
--sjdbGTFfile /mnt/data/GCF_000001405.40_GRCh38.p14_genomic.gtf \
--sjdbOverhang 29 \
--genomeSAindexNbases 17

Feb 09 16:10:32 ..... started STAR run Feb 09 16:10:32 ... starting to generate Genome files Feb 09 16:11:14 ..... processing annotations GTF !!!!! WARNING: --genomeSAindexNbases 17 is too large for the genome size=3298430636, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 14 Feb 09 16:11:52 ... starting to sort Suffix Array. This may take a long time... Feb 09 16:12:48 ... sorting Suffix Array chunks and saving them to disk... Feb 09 16:30:12 ... loading chunks from disk, packing SA... Feb 09 16:31:33 ... finished generating suffix array Feb 09 16:31:33 ... generating Suffix Array index

BUG: next index is smaller than previous, EXITING

Feb 09 16:37:41 ...... FATAL ERROR, exiting

STAR \
--runThreadN 23 \
--runMode genomeGenerate \
--genomeDir /mnt/data/overhangs \
--genomeFastaFiles /mnt/data/GCF_000001405.40_GRCh38.p14_genomic.fna \
--sjdbGTFfile /mnt/data/GCF_000001405.40_GRCh38.p14_genomic.gtf \
--sjdbOverhang 29 \
--genomeSAindexNbases 16

Feb 09 16:38:00 ..... started STAR run Feb 09 16:38:00 ... starting to generate Genome files Feb 09 16:38:41 ..... processing annotations GTF !!!!! WARNING: --genomeSAindexNbases 16 is too large for the genome size=3298430636, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 14 Feb 09 16:39:19 ... starting to sort Suffix Array. This may take a long time... Feb 09 16:40:16 ... sorting Suffix Array chunks and saving them to disk... Feb 09 16:58:43 ... loading chunks from disk, packing SA... Feb 09 17:00:05 ... finished generating suffix array Feb 09 17:00:05 ... generating Suffix Array index Feb 09 17:22:38 ... completed Suffix Array index Feb 09 17:22:39 ..... inserting junctions into the genome indices Feb 09 17:25:54 ... writing Genome to disk ... Feb 09 17:25:55 ... writing Suffix Array to disk ... Feb 09 17:26:01 ... writing SAindex to disk Feb 09 17:26:08 ..... finished successfully

alexdobin commented 1 year ago

Hi Adam

the maximum recommended value for --genomeSAindexNbases is 14 . Why would you want to go above it?