alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 503 forks source link

Question regarding STAR parameters #1485

Closed lucygarner closed 2 years ago

lucygarner commented 2 years ago

Hi,

I asked a similar question on the Google group, however I'm not sure whether the message got sent properly so I thought I would ask here as well.

I have stranded 150 bp paired-end reads (bulk RNA-seq).

The parameters I am currently using are: STAR --outSAMprimaryFlag AllBestScore --outMultimapperOrder Random --outFilterType BySJout --outReadsUnmapped Fastx --outFilterMismatchNoverReadLmax 0.04 --genomeDir 210409_human_Ensembl/star_index --readFilesCommand zcat --readFilesIn 5083_V1_1P.fastq.gz 5083_V1_2P.fastq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix mapping.dir/5083_V1.

I have a couple of questions:

  1. Do these parameters seem reasonable?
  2. Does theoutFilterType BySJout argument only have an impact if you are doing two-pass mapping?
  3. Could you explain why the outFilterMismatchNoverReadLmax is set to 1.0 by default? I have set this to 0.04 based on the ENCODE settings as this seemed a more reasonable level of accepted mismatches. Would you recommend also making outFilterMismatchNoverLmax more stringent?
  4. Are there any other outFilter parameters that you would recommend tweaking?

Best wishes, Lucy

alexdobin commented 2 years ago

Hi Lucy,

your parameters look good. --outFilterType BySJout applies splice junction filtering to the SAM alignments, preventing junctions with short overhang, low counts, etc:

### Output Filtering: Splice Junctions
outSJfilterReads                All
    string: which reads to consider for collapsed splice junctions output
                                All     ... all reads, unique- and multi-mappers
                                Unique  ... uniquely mapping reads only

outSJfilterOverhangMin          30  12  12  12
    4 integers:    minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
                                does not apply to annotated junctions

outSJfilterCountUniqueMin       3   1   1   1
    4 integers: minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
                                Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied
                                does not apply to annotated junctions

outSJfilterCountTotalMin     3   1   1   1
    4 integers: minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
                                Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied
                                does not apply to annotated junctions

outSJfilterDistToOtherSJmin     10  0   5   10
    4 integers>=0: minimum allowed distance to other junctions' donor/acceptor
                                does not apply to annotated junctions

outSJfilterIntronMaxVsReadN        50000 100000 200000
    N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads
                                i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax
                                does not apply to annotated junctions

The --outFilter* parameters are applied in the "AND" fashion, so the most stringent condition wins. For mismatches, the most stringent default condition is --outFilterMismatchNmax 10 which allows up to 10 mismatches. It's probably on the high side for most uses.

Among the --outFilter* parameters the most important are

outFilterScoreMinOverLread      0.66
    real: same as outFilterScoreMin, but  normalized to read length (sum of mates' lengths for paired-end reads)

outFilterMatchNminOverLread     0.66
    real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).

They control the ultimate stringency of the alignment. They need to be increased for more stringent alignments.

Cheers Alex

lucygarner commented 2 years ago

Hi Alex,

Thank you for the response.

A few follow-up questions/clarifications. 1) Am I correct in thinking that if you are only going to use the BAM file for downstream analyses, then you don't need to worry about the --outFilterType parameter? Or will this also influence the BAM output?

2) Would you then recommend changing outFilterMismatchNmax to something lower e.g. 2, 5?

3) What values would you recommend for outFilterScoreMinOverLread and outFilterMatchNminOverLread for stringent alignment? Are there any disadvantages of increasing these parameters? e.g. is it possible that lower or higher quality alignments are biased towards a particular "class" of gene?

Best wishes, Lucy

alexdobin commented 2 years ago

Hi Lucy,

1. --outFilterType BySJout affect the BAM file: it prohibit alignments that contain junctions that do not pass SJout filtering.

2. I think the best approach is to scale the max number of mismatches with the read length. The average sequencing error rate is low, <0.01, but we want to allow fluctuations (i.e. more mismatches than expected on average), and also allow for SNPs. --outFilterMismatchNmax

3. I would go as high as --outFilterScoreMinOverLread 0.9 and --outFilterMatchNminOverLread 0.9

Thanks, Alex

jscaber commented 2 years ago

Dear Alex,

I came here by chance today looking for an explanation of the above options, and found this very helpful - thank you!

A related question, that I have not found an answer for on previous threads: for the 2-pass option my understanding is that one would use the stringent options one chooses options for the first pass, but not the second (as spurious junctions would be filtered by the BySJout option)?

Best, Jakub

alexdobin commented 2 years ago

Hi Jakub,

similarly to the 1-step mapping, if you relax filtering in the 2nd step, you will see more reads aligned (i.e. increased sensitivity), but at the expense of higher false-positive rate. Stringent parameters in the 1st step allow for a more precise (but less sensitive) set of "annotated" junctions for the 2nd step.

Best, Alex

jscaber commented 2 years ago

Aah, of course! I completely missed that the --outFilterType BySJout option does not make a difference in the first run which is there to generate a junctions file, because we regenerate the BAM file from fastq in the second run, and don't use the first pass BAM. Therefore the --outFilterType BySJout option is only really useful for the second run.

Got it: stringency options matter for both, and will have slightly different effects both times - stringency of junctions file for first run, and stringency of final output second time. Thanks a lot!

Best wishes, Jakub

lucygarner commented 2 years ago

Thank you @alexdobin.

With regard to point 2 above, I have set --outFilterMismatchNoverReadLmax to 0.04. If I understand correctly, this would allow 6 mismatches in my 150 bp reads, so this would be my most stringent filtering. Is 0.04 reasonable (I saw it was the ENCODE standard option for long RNA-seq pipeline), or would you go for 0.02 or 0.03?

alexdobin commented 2 years ago

Hi Lucy, we have selected 0.04 after a long discussion within ENCODE. This is, of course, much higher than the typical Illumina error rate (<1%), but we thought it's important to allow for enough mismatches to align reads from loci with a large number of SNPs.

lucygarner commented 2 years ago

Ok great, thank you.

lucygarner commented 1 year ago

Thank you @alexdobin.

With regard to point 2 above, I have set --outFilterMismatchNoverReadLmax to 0.04. If I understand correctly, this would allow 6 mismatches in my 150 bp reads, so this would be my most stringent filtering. Is 0.04 reasonable (I saw it was the ENCODE standard option for long RNA-seq pipeline), or would you go for 0.02 or 0.03?

I was just looking at this parameter again. My reads are 150 bp paired-end, so would this actually allow 0.04 * 300 mismatches (i.e. 12)?

alexdobin commented 1 year ago

Yes, that's correct - it would allow up to 12 mismatches in the 2*150 PE read.