alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Read number statistics in Log.final.out #1688

Open frederick-de-baene opened 2 years ago

frederick-de-baene commented 2 years ago

Hi

We have used Star (version 2.7.9a) to map paired-end reads, and obtained the following Log.final.out file:

                                 Started job on |   Jul 06 08:01:20
                             Started mapping on |   Jul 06 08:38:39
                                    Finished on |   Jul 06 09:44:44
       Mapping speed, Million of reads per hour |   45.65

                          Number of input reads |   50279327
                      Average input read length |   150
                                    UNIQUE READS:
                   Uniquely mapped reads number |   41934056
                        Uniquely mapped reads % |   83.40%
                          Average mapped length |   149.77
                       Number of splices: Total |   24574513
            Number of splices: Annotated (sjdb) |   24349092
                       Number of splices: GT/AG |   24066725
                       Number of splices: GC/AG |   230282
                       Number of splices: AT/AC |   24752
               Number of splices: Non-canonical |   252754
                      Mismatch rate per base, % |   0.35%
                         Deletion rate per base |   0.10%
                        Deletion average length |   2.21
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.32
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   5381462
             % of reads mapped to multiple loci |   10.70%
        Number of reads mapped to too many loci |   641111
             % of reads mapped to too many loci |   1.28%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   1621558
                 % of reads unmapped: too short |   3.23%
                Number of reads unmapped: other |   545954
                     % of reads unmapped: other |   1.09%
                                  CHIMERIC READS:
                       Number of chimeric reads |   557990
                            % of chimeric reads |   1.11%

Looking into more detail, it seems the number of reads do not add up correctly. According to post #282, the number of input reads is calculated as follows:

total input read = uniquely mapped reads + mapped to multiple loci + mapped to too many loci + too many mismatches + too short + other

However, when applying this formula to the above log file:

uniquely mapped reads + mapped to multiple loci + mapped to too many loci + too many mismatches + too short + other = 41934056 + 5381462 + 641111 + 0 + 1621558 + 545954 = 50124141

Which is 155186 less than the number of input reads according to the log file (50279327).

The command used to obtain this log file:

STAR
    --genomeDir <path-to-index-dir> \
    --sjdbGTFfile <path-to-gtf-file> \
    --readFilesIn <read-files>  \
    --runThreadN <N> \
    --twopassMode Basic \
    --outSAMtype BAM Unsorted \
    --readFilesCommand zcat \
    --runDirPerm All_RWX \
    --outFileNamePrefix <prefix> \
    --chimSegmentMin 10 \
    --chimOutType WithinBAM SoftClip \
    --chimJunctionOverhangMin 10 \
    --chimScoreMin 1 \
    --chimScoreDropMax 30 \
    --chimScoreJunctionNonGTAG 0 \
    --chimScoreSeparation 1 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --chimSegmentReadGapMax 3

What could explain this discrepancy?

Thank you for having a look.

BR

alexdobin commented 2 years ago

Hi Frederick,

when the chimeric output option is used, some of the chimeric reads do not map non-chimerically. They are not counted as mapped or unmapped in the Log.final.out file.

frederick-de-baene commented 2 years ago

So, if I understand correctly, this data set has 557990 chimeric read pairs which have chimeric segments. Of those 557990 chimeric read pairs, there is a first subset which has both segments mapped (one segment mapped non-chimerically (let's call it seg_1)) and the other mapped chimerically (let's call it seg_2)), a second subset for which only one segment was mapped non-chimerically (seg_1), and a third subset for which only one segment mapped chimerically (supplementary alignment) (seg_2).

In post #282, you mentioned the following: << Chimeric alignments are counted separately - some of them are reported as mapped (if non-chimeric alignment passes the mapped filter), and some as unmapped (otherwise). >>

But, putting this together, it seems that chimeric read pairs, which are counted separately and for which the total no. of read pairs can be found in the log file (i.e., 557990), are either reported as mapped (both seg_1 and seg_2 are mapped), unmapped (seg_1 is mapped but seg_2 not), or neither (neither seg_1 and seg_2 may be or be not mapped).

So, in the example above, there are 155186 chimeric read pairs which are not reported as mapped nor unmapped because they do not have their seg_1 mapped.

As a follow-up question, why are these 155186 read pairs not just reported as unmapped?

alexdobin commented 2 years ago

This is correct. The reads that do not map normally, but map chimerically, are not reported as unmapped.

frederick-de-baene commented 2 years ago

Okay, thank you for the clarification. And what is the reason for not reporting them neither as mapped nor unmapped?

alexdobin commented 2 years ago

The reads that map chimerically are not considered unmapped because --chimOutType WithinBAM forces their output to the BAM file.