Closed lucygarner closed 2 years ago
Hi Lucy,
your parameters look good. --outFilterType BySJout applies splice junction filtering to the SAM alignments, preventing junctions with short overhang, low counts, etc:
### Output Filtering: Splice Junctions
outSJfilterReads All
string: which reads to consider for collapsed splice junctions output
All ... all reads, unique- and multi-mappers
Unique ... uniquely mapping reads only
outSJfilterOverhangMin 30 12 12 12
4 integers: minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
does not apply to annotated junctions
outSJfilterCountUniqueMin 3 1 1 1
4 integers: minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied
does not apply to annotated junctions
outSJfilterCountTotalMin 3 1 1 1
4 integers: minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied
does not apply to annotated junctions
outSJfilterDistToOtherSJmin 10 0 5 10
4 integers>=0: minimum allowed distance to other junctions' donor/acceptor
does not apply to annotated junctions
outSJfilterIntronMaxVsReadN 50000 100000 200000
N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads
i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax
does not apply to annotated junctions
The --outFilter* parameters are applied in the "AND" fashion, so the most stringent condition wins.
For mismatches, the most stringent default condition is --outFilterMismatchNmax 10
which allows up to 10 mismatches.
It's probably on the high side for most uses.
Among the --outFilter* parameters the most important are
outFilterScoreMinOverLread 0.66
real: same as outFilterScoreMin, but normalized to read length (sum of mates' lengths for paired-end reads)
outFilterMatchNminOverLread 0.66
real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).
They control the ultimate stringency of the alignment. They need to be increased for more stringent alignments.
Cheers Alex
Hi Alex,
Thank you for the response.
A few follow-up questions/clarifications.
1) Am I correct in thinking that if you are only going to use the BAM file for downstream analyses, then you don't need to worry about the --outFilterType
parameter? Or will this also influence the BAM output?
2) Would you then recommend changing outFilterMismatchNmax
to something lower e.g. 2, 5?
3) What values would you recommend for outFilterScoreMinOverLread
and outFilterMatchNminOverLread
for stringent alignment? Are there any disadvantages of increasing these parameters? e.g. is it possible that lower or higher quality alignments are biased towards a particular "class" of gene?
Best wishes, Lucy
Hi Lucy,
1.
--outFilterType BySJout
affect the BAM file: it prohibit alignments that contain junctions that do not pass SJout filtering.
2.
I think the best approach is to scale the max number of mismatches with the read length. The average sequencing error rate is low, <0.01, but we want to allow fluctuations (i.e. more mismatches than expected on average), and also allow for SNPs.
--outFilterMismatchNmax
3.
I would go as high as
--outFilterScoreMinOverLread 0.9
and --outFilterMatchNminOverLread 0.9
Thanks, Alex
Dear Alex,
I came here by chance today looking for an explanation of the above options, and found this very helpful - thank you!
A related question, that I have not found an answer for on previous threads: for the 2-pass option my understanding is that one would use the stringent options one chooses options for the first pass, but not the second (as spurious junctions would be filtered by the BySJout option)?
Best, Jakub
Hi Jakub,
similarly to the 1-step mapping, if you relax filtering in the 2nd step, you will see more reads aligned (i.e. increased sensitivity), but at the expense of higher false-positive rate. Stringent parameters in the 1st step allow for a more precise (but less sensitive) set of "annotated" junctions for the 2nd step.
Best, Alex
Aah, of course! I completely missed that the --outFilterType BySJout
option does not make a difference in the first run which is there to generate a junctions file, because we regenerate the BAM file from fastq in the second run, and don't use the first pass BAM. Therefore the --outFilterType BySJout
option is only really useful for the second run.
Got it: stringency options matter for both, and will have slightly different effects both times - stringency of junctions file for first run, and stringency of final output second time. Thanks a lot!
Best wishes, Jakub
Thank you @alexdobin.
With regard to point 2 above, I have set --outFilterMismatchNoverReadLmax
to 0.04. If I understand correctly, this would allow 6 mismatches in my 150 bp reads, so this would be my most stringent filtering. Is 0.04 reasonable (I saw it was the ENCODE standard option for long RNA-seq pipeline), or would you go for 0.02 or 0.03?
Hi Lucy, we have selected 0.04 after a long discussion within ENCODE. This is, of course, much higher than the typical Illumina error rate (<1%), but we thought it's important to allow for enough mismatches to align reads from loci with a large number of SNPs.
Ok great, thank you.
Thank you @alexdobin.
With regard to point 2 above, I have set
--outFilterMismatchNoverReadLmax
to 0.04. If I understand correctly, this would allow 6 mismatches in my 150 bp reads, so this would be my most stringent filtering. Is 0.04 reasonable (I saw it was the ENCODE standard option for long RNA-seq pipeline), or would you go for 0.02 or 0.03?
I was just looking at this parameter again. My reads are 150 bp paired-end, so would this actually allow 0.04 * 300 mismatches (i.e. 12)?
Yes, that's correct - it would allow up to 12 mismatches in the 2*150 PE read.
Hi,
I asked a similar question on the Google group, however I'm not sure whether the message got sent properly so I thought I would ask here as well.
I have stranded 150 bp paired-end reads (bulk RNA-seq).
The parameters I am currently using are:
STAR --outSAMprimaryFlag AllBestScore --outMultimapperOrder Random --outFilterType BySJout --outReadsUnmapped Fastx --outFilterMismatchNoverReadLmax 0.04 --genomeDir 210409_human_Ensembl/star_index --readFilesCommand zcat --readFilesIn 5083_V1_1P.fastq.gz 5083_V1_2P.fastq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix mapping.dir/5083_V1.
I have a couple of questions:
outFilterType BySJout
argument only have an impact if you are doing two-pass mapping?outFilterMismatchNoverReadLmax
is set to 1.0 by default? I have set this to 0.04 based on the ENCODE settings as this seemed a more reasonable level of accepted mismatches. Would you recommend also makingoutFilterMismatchNoverLmax
more stringent?outFilter
parameters that you would recommend tweaking?Best wishes, Lucy