alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Large number of discordant mapping, is this normal? #699

Closed yz201906 closed 5 years ago

yz201906 commented 5 years ago

I have a set of trimmed PE reads (originally 2 by 75) aligned to mm10 reference genome. Command used: STAR --runThreadN 12 --outMultimapperOrder Random --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 --outFilterMismatchNoverLmax 0.05 --limitBAMsortRAM 1006613967 --peOverlapNbasesMin 50 --genomeDir $genome_path --readFilesIn $R1 $R2 --outFileNamePrefix ./$sample/ --outSAMtype BAM SortedByCoordinate As I understand,STAR by default should only output concordant alignments, however, when I convert the output bam to bedpe using bamtobed, I see a lot of this:

chr11   109012061       109012092       chr13   97190584        97190615        K00269:217:H7F5MBBXY:1:1101:1479:4022   3       -       -
chr13   97190584        97190615        chr11   109012061       109012092       K00269:217:H7F5MBBXY:1:1101:1479:4022   3       +       +
chr13   67139456        67139480        chr13   67139960        67139984        K00269:217:H7F5MBBXY:1:1101:1499:13025  0       +       +

These pairs are definitely discordant. I understand the fields are flagged with 3,0,1 etc not 255, but not every tool utilizes that field. Is this behavior intended? We rely on bedtools genomecov to visualize our data as this seems to be the only tool that we can easily calculate base coverage of only 3'-end of every alignment. And unfortunately, it also doesn't care about the flags. Is there a way to completely not output these alignments to the bam?

yz201906 commented 5 years ago

I forgot to post the Log.final.out:

                               Started job on |       Jul 19 19:50:39
                             Started mapping on |       Jul 19 19:51:50
                                    Finished on |       Jul 19 19:52:11
       Mapping speed, Million of reads per hour |       136.59

                          Number of input reads |       796759
                      Average input read length |       77
                                    UNIQUE READS:
                   Uniquely mapped reads number |       409013
                        Uniquely mapped reads % |       51.33%
                          Average mapped length |       69.80
                       Number of splices: Total |       81889
            Number of splices: Annotated (sjdb) |       81352
                       Number of splices: GT/AG |       61535
                       Number of splices: GC/AG |       18931
                       Number of splices: AT/AC |       160
               Number of splices: Non-canonical |       1263
                      Mismatch rate per base, % |       0.60%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.77
                        Insertion rate per base |       0.02%
                       Insertion average length |       1.12
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       347387
             % of reads mapped to multiple loci |       43.60%
        Number of reads mapped to too many loci |       19388
             % of reads mapped to too many loci |       2.43%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       7948
       % of reads unmapped: too many mismatches |       1.00%
            Number of reads unmapped: too short |       10742
                 % of reads unmapped: too short |       1.35%
                Number of reads unmapped: other |       2281
                     % of reads unmapped: other |       0.29%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Also here is the flagstats before filtering out improperly paired reads using samtools view:

2800788 + 0 in total (QC-passed reads + QC-failed reads)
1424653 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2800788 + 0 mapped (100.00% : N/A)
1376135 + 0 paired in sequencing
638157 + 0 read1
737978 + 0 read2
1239470 + 0 properly paired (90.07% : N/A)
1239470 + 0 with itself and mate mapped
136665 + 0 singletons (9.93% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

and after:

2420388 + 0 in total (QC-passed reads + QC-failed reads)
1180918 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2420388 + 0 mapped (100.00% : N/A)
1239470 + 0 paired in sequencing
619735 + 0 read1
619735 + 0 read2
1239470 + 0 properly paired (100.00% : N/A)
1239470 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Although it is showing 100% properly paired reads, it seems like STAR marks those pairs aligned to the same strand as concordant pairs.

alexdobin commented 5 years ago

Hi @yz201906

if you use --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 (or any values that are <0.5), STAR will output single-end alignments, which may result in the output of two discordant alignments for each mate. The parameters above control the min mapped length/score normalized to the total read length (sum of two mates), so to prevent one-mate alignments you need to set them >0.5 . Note, that it may not be enough to prevent one-mate alignments if one of the mates is trimmed too much. To make sure that no one-mate alignments are output, use --outFilterMatchNmin <UntrimmedMateLength+1>.

Cheers Alex

yz201906 commented 5 years ago

Ah, I see, thanks for the clarification.