Help: Speedup non-spliced mapping

sklages commented 7 years ago

Hi,

I want to use STAR to map some large Illumina RNA datasets (98bp, 400-600mio, single-end) to a quite repetitive "transcriptome". The reference contains ~3mio sequences, ranging from 100bp to little over 40kb.

I do not need any spliced alignment, I just want to have the best alignment (if any) for each sequence. I am aware I could use bwa or bowtie but I want to use STAR for this approach.

But it takes very long even on 40 cores.

What are the key parameters to speedup the process?

I used the following parameters:

runMode  alignReads
genomeLoad  NoSharedMemory
readFilesCommand  gunzip -c
readMapNumber  -1
outReadsUnmapped  None
outSAMtype  SAM
outSAMmode  NoQS
outSAMattributes NH HI AS nM NM MD XS RG
outSAMunmapped  None
outSAMprimaryFlag  OneBestScore
outSAMmapqUnique  255
outSAMmultNmax 2
outFilterMultimapNmax  1000000
outFilterMismatchNmax  10
alignEndsType  Local
chimOutType  SeparateSAMold
chimMainSegmentMultNmax  2

outFilterMatchNmin 75

thanks, Sven

alexdobin commented 7 years ago

Hi Sven,

You can try to reduce --seedPerWindowNmax from default 50 or 30 or even less. How many times you expect the reads to map? If you are not looking for spliced alignments, --alignIntronMax 1 will prohibit splicing and may help with the speed.

Please send me the Log.out and Log.final.out file for a completed run.

Cheers Alex

sklages commented 7 years ago

Hi Alex,

I am aligning the datasets against UCSC's RepeatMasker track (>100bp)

~80% of the reads won't map at all
from the remaining 20% I expect a lot of multi-mappers. But I just need the best one.

This setup may be subject to change, that's why I want to use STAR from the very beginning :-)

I will run STAR with the proposed parameters and will come back once the alignment has finished.

thanks, Sven

sklages commented 6 years ago

Hi Alex,

these are the logs; it took 3 days on 40 cores with a single-read dataset (380mio reads). https://ws.molgen.mpg.de/ws/590885/SAMPLE_RMSK.Log.final.out https://ws.molgen.mpg.de/ws/807478/SAMPLE_RMSK.Log.out

I will use a "genomic approach" for now; genome + GTF. I should get the same results, but probably faster.

Can I direct STAR to only report hits to known transcripts (provided by the gtf file)?

best, Sven

alexdobin commented 6 years ago

Hi Sven,

there is no option to output just those alignments. However, if your "exons" (intervals) in the Repeat-Masker GTF track do not overlap, you can use --quantMode TranscriptomeSAM which will output alignments in the "transcript" coordinates - as if you were mapping only to the RepeatMasker intervals. You can suppress the standard SAM output with --outSAMtype None.

Cheers Alex

sklages commented 6 years ago

Hi Alex, I am a bit puzzled that the alignment to the genome takes as long as for the "transcriptome" mapping. So even with such "pimped" parameter set it seems that speed of alignment is directly dependent on the "number of transcripts", either as single sequences when using "transcriptome" reference or as genomic GTF annotations as "exon" .. as this number is the same for both approaches, a bit more than 3mio.

I thought I could gain more speed using a genomic reference ..

best, Sven

sklages commented 6 years ago

OK, I think I found the explanation for my data (to be aligned to "transcriptome-like" reference). https://groups.google.com/forum/#!topic/rna-star/I1phytOigdE

Only 8-18% of my reads actually match this reference.

As for the genome alignment I took the wrong cfg file (with spliced alignment "switched off").

alexdobin / STAR

Help: Speedup non-spliced mapping #343