bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

Questions about TE and Gene quantification #47

Closed liuweihahaha closed 4 weeks ago

liuweihahaha commented 1 month ago

My process is to send the bam file output by STARsolo to SoloTE, but the output result of SoloTE makes me feel confused.The following is the output result of the SoloTE:

A total of 9074490 UMIs are in the final matrix. Of these, 2580135 (28.433%) correspond to genes. and 6494355 (71.567%) correspond to TEs. TE detected UMIs are distributed as follows: Locus-specific TEs: 5922649 UMIs (91.197%). Subfamily TEs: 571706 (8.803%).

Only 9074490 UMI were counted to the final result. When I looked at the bam file, I found that there were actually 353056107reads in total. The bam file results are as follows:

Started job on | Jun 01 10:58:44 Started mapping on | Jun 01 10:59:03 Finished on | Jun 01 11:29:13 Mapping speed, Million of reads per hour | 702.21

                      Number of input reads |   353056107
                  Average input read length |   150
                                UNIQUE READS:
               Uniquely mapped reads number |   233058988
                    Uniquely mapped reads % |   66.01%
                      Average mapped length |   149.36
                   Number of splices: Total |   151247104
        Number of splices: Annotated (sjdb) |   148955816
                   Number of splices: GT/AG |   149404255
                   Number of splices: GC/AG |   1223002
                   Number of splices: AT/AC |   130002
           Number of splices: Non-canonical |   489845
                  Mismatch rate per base, % |   0.24%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.79
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.48
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   95549477
         % of reads mapped to multiple loci |   27.06%
    Number of reads mapped to too many loci |   1072
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 23456573 % of reads unmapped: too short | 6.64% Number of reads unmapped: other | 989997 % of reads unmapped: other | 0.28% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

liuweihahaha commented 1 month ago

I sincerely hope you can answer this question,Thank you.

bvaldebenitom commented 4 weeks ago

Hi @liuweihahaha ,

I hope you are doing well.

The difference between UMIs and total number of reads is related to the amplification process carried out for sequencing. Here are 2 resources from 10X Genomics where this is described: https://kb.10xgenomics.com/hc/en-us/articles/115004037743-How-does-Cell-Ranger-correct-for-amplification-bias https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview

If this is helpful, please proceed to close the issue. If not, let me know of further questions.