alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 501 forks source link

Reads number statistic in Log.final.out compared to mate1/2 #1004

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi,

I am using STAR 2.7.3a for extracting unmapped reads with the --outReadsUnmapped Fastx. I found that the number of unmapped reads in both mate1/2 is different from the Log.final.out output. The post #282 already mentioned the difference in unmapped reads number between the Log.final.out and the BAM flag and the answer is clear to me.

Here is the Log.final.out:

                            Started job on |  Jun 12 20:39:42
                        Started mapping on |  Jun 12 20:39:43
                               Finished on |  Jun 12 21:06:39
  Mapping speed, Million of reads per hour |  117.21

                     Number of input reads |  52616406
                 Average input read length |  148
                               UNIQUE READS:
              Uniquely mapped reads number |  22845052
                   Uniquely mapped reads % |  43.42%
                     Average mapped length |  147.56
                  Number of splices: Total |  8902714
       Number of splices: Annotated (sjdb) |  7241157
                  Number of splices: GT/AG |  8840249
                  Number of splices: GC/AG |  48363
                  Number of splices: AT/AC |  5141
          Number of splices: Non-canonical |  8961
                 Mismatch rate per base, % |  0.40%
                    Deletion rate per base |  0.01%
                   Deletion average length |  1.43
                   Insertion rate per base |  0.00%
                  Insertion average length |  1.29
                        MULTI-MAPPING READS:
   Number of reads mapped to multiple loci |  28132601
        % of reads mapped to multiple loci |  53.47%
   Number of reads mapped to too many loci |  0
        % of reads mapped to too many loci |  0.00%
                             UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 1632629 % of reads unmapped: too short | 3.10% Number of reads unmapped: other | 6124 % of reads unmapped: other | 0.01% CHIMERIC READS: Number of chimeric reads | 144988 % of chimeric reads | 0.28%

Then I made some calculation as explained in post #282 to get the number of unmapped reads:

Unmapped = 1'632'629 + 6'124 = 1'638'753 -> 2X = 3'277'506

samtools view -c -f 4 S1084_S5_L001_Aligned.out.bam 3'315'594 samtools view -c -f 8 S1084_S5_L001_Aligned.out.bam 3'381'778 samtools view -c -f 12 S1084_S5_L001_Aligned.out.bam 3'277'506

Now, when calculating the number of unmapped reads present in both mate1 and 2 files, the number of unmapped reads is inconsistent with samtools and the Log.final.out.

expr $(cat S1084_S5_L001_Unmapped.out.mate1 | wc -l ) / 4 = 1'676'841 -> 2X = 3'353'682 expr $(cat S1084_S5_L001_Unmapped.out.mate2 | wc -l ) / 4 = 1'676'841

Why is there a difference between the Log.final.out and the Unmapped.out.mate1/2 in the number of unmapped reads?

alexdobin commented 4 years ago

Hi @Schrnz25

in addition to "truly" unmapped reads, the Unmapped file contains reads that are mapped as PE reads, but one of the mates is not mapped. Note that this should not happen for default parameters if mates have the same lengths. It does happen in your case, since the samtools view -c -f 4 and -f 8 are different.

To differentiate these cases in the Unmapped file, next to the readsID, there is a tag = 00 for both mates unmapped, and 01 and 10 for one of the mates mapped and the other unmapped.

Cheers Alex

ghost commented 4 years ago

Hi Alex,

So if I understand correctly, both mate1/2 contain "truly" unmapped mates as well as unmapped reads whose opposite mapped (and those mapped opposites should not be present in the other mate file)? If so, why do mate1/2 possess the exact same number of sequences? Is it pure chance?

Also, you mentioned that this should not happen for default parameters. What are the parameters involved in this? Here are mine:

##### Final user re-defined parameters-----------------:
runMode                           alignReads
runThreadN                        22
genomeDir                         /home/genomeDir/
genomeLoad                        LoadAndKeep
readFilesIn                       S1084_S5_L001_1_paired.fq.gz,S1084_S5_L002_1_paired.fq.gz,S1084_S5_L003_1_paired.fq.gz,S1084_S5_L004_1_paired.fq.gz   S1084_S5_L001_2_paired.fq.gz,S1084_S5_L002_2_paired.fq.gz,S1084_S5_L003_2_paired.fq.gz,S1084_S5_L004_2_paired.fq.gz   
readFilesCommand                  gunzip   -c   
outFileNamePrefix                 /home/STAR_output/S1084_S5_L001/S1084_S5_L001_
outReadsUnmapped                  Fastx
outSAMtype                        BAM   Unsorted   
outSAMunmapped                    Within   
outFilterMultimapNmax             1000
chimSegmentMin                    20
chimOutType                       Junctions   
quantMode                         GeneCounts

Kind regards

alexdobin commented 4 years ago

Hi @Schrnz25

the Unmapped1/2.fasta file have to preserve the order of the reads, so even if only one mate does not map, both mates are output. Such reads will be tagged with 01 or 10 tags.

The single-mate mapping could happen even with default parameters, if the mates have different lengths (say, after trimming). This should be why you see them.

Cheers Alex