Closed sklages closed 6 years ago
Hi Sven,
You can try to reduce --seedPerWindowNmax from default 50 or 30 or even less. How many times you expect the reads to map? If you are not looking for spliced alignments, --alignIntronMax 1 will prohibit splicing and may help with the speed.
Please send me the Log.out and Log.final.out file for a completed run.
Cheers Alex
Hi Alex,
I am aligning the datasets against UCSC's RepeatMasker track (>100bp)
This setup may be subject to change, that's why I want to use STAR from the very beginning :-)
I will run STAR with the proposed parameters and will come back once the alignment has finished.
thanks, Sven
Hi Alex,
these are the logs; it took 3 days on 40 cores with a single-read dataset (380mio reads). https://ws.molgen.mpg.de/ws/590885/SAMPLE_RMSK.Log.final.out https://ws.molgen.mpg.de/ws/807478/SAMPLE_RMSK.Log.out
I will use a "genomic approach" for now; genome + GTF. I should get the same results, but probably faster.
Can I direct STAR to only report hits to known transcripts (provided by the gtf file)?
best, Sven
Hi Sven,
there is no option to output just those alignments. However, if your "exons" (intervals) in the Repeat-Masker GTF track do not overlap, you can use --quantMode TranscriptomeSAM which will output alignments in the "transcript" coordinates - as if you were mapping only to the RepeatMasker intervals. You can suppress the standard SAM output with --outSAMtype None.
Cheers Alex
Hi Alex, I am a bit puzzled that the alignment to the genome takes as long as for the "transcriptome" mapping. So even with such "pimped" parameter set it seems that speed of alignment is directly dependent on the "number of transcripts", either as single sequences when using "transcriptome" reference or as genomic GTF annotations as "exon" .. as this number is the same for both approaches, a bit more than 3mio.
I thought I could gain more speed using a genomic reference ..
best, Sven
OK, I think I found the explanation for my data (to be aligned to "transcriptome-like" reference). https://groups.google.com/forum/#!topic/rna-star/I1phytOigdE
Only 8-18% of my reads actually match this reference.
As for the genome alignment I took the wrong cfg file (with spliced alignment "switched off").
Hi,
I want to use STAR to map some large Illumina RNA datasets (98bp, 400-600mio, single-end) to a quite repetitive "transcriptome". The reference contains ~3mio sequences, ranging from 100bp to little over 40kb.
I do not need any spliced alignment, I just want to have the best alignment (if any) for each sequence. I am aware I could use
bwa
orbowtie
but I want to use STAR for this approach.But it takes very long even on 40 cores.
What are the key parameters to speedup the process?
I used the following parameters:
thanks, Sven