alexdobin / STAR

RNA-seq aligner
MIT License
1.79k stars 499 forks source link

T2T chrm13 genome mapping with highly unmapped reads AND parameters tuning didn't work #2053

Open gnilihzeux opened 5 months ago

gnilihzeux commented 5 months ago

Dear author, There are very high ratio unmapped reads for 'too short' and 'other' while mapping to T2T chrm13 genome, but it worked for hg19 genome. BWT, there is a 83% reads mapping to T2T with bowtie2. Our data is RNA-seq with ribosome fractions. Our group had modified some parameters related to repeats, including --winAnchorMultimapNmax higer, --outFilterMultimapNmax higher, --alignIntronMin 1. But all tunes didn't work.

What parameters should been set?

Thanks a lot.

The logs are follow: T2T

Started job on |       Jan 25 02:22:54
                             Started mapping on |       Jan 25 02:23:47
                                    Finished on |       Jan 25 02:43:58
       Mapping speed, Million of reads per hour |       108.31

                          Number of input reads |       36434753
                      Average input read length |       283
                                    UNIQUE READS:
                   Uniquely mapped reads number |       2537986
                        Uniquely mapped reads % |       6.97%
                          Average mapped length |       278.61
                       Number of splices: Total |       1204395
            Number of splices: Annotated (sjdb) |       1136229
                       Number of splices: GT/AG |       1164313
                       Number of splices: GC/AG |       9637
                       Number of splices: AT/AC |       1101
               Number of splices: Non-canonical |       29344
                      Mismatch rate per base, % |       0.48%
                         Deletion rate per base |       0.09%
                        Deletion average length |       1.85
                        Insertion rate per base |       0.04%
                       Insertion average length |       1.37
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       456923
             % of reads mapped to multiple loci |       1.25%
        Number of reads mapped to too many loci |       190199
             % of reads mapped to too many loci |       0.52%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.06%
                 % of reads unmapped: too short |       48.17%
                     % of reads unmapped: other |       43.03%
                                  CHIMERIC READS:
                       Number of chimeric reads |       26935
                            % of chimeric reads |       0.07%

hg19

Mapping speed, Million of reads per hour |       230.11

                          Number of input reads |       36434753
                      Average input read length |       283
                                    UNIQUE READS:
                   Uniquely mapped reads number |       8063144
                        Uniquely mapped reads % |       22.13%
                          Average mapped length |       285.69
                       Number of splices: Total |       1959408
            Number of splices: Annotated (sjdb) |       1133733
                       Number of splices: GT/AG |       1246660
                       Number of splices: GC/AG |       36075
                       Number of splices: AT/AC |       2267
               Number of splices: Non-canonical |       674406
                      Mismatch rate per base, % |       0.39%
                         Deletion rate per base |       0.12%
                        Deletion average length |       1.18
                        Insertion rate per base |       0.02%
                       Insertion average length |       1.12
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       27457561
             % of reads mapped to multiple loci |       75.36%
        Number of reads mapped to too many loci |       13369
             % of reads mapped to too many loci |       0.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.07%
                 % of reads unmapped: too short |       2.32%
                     % of reads unmapped: other |       0.09%
                                  CHIMERIC READS:
                       Number of chimeric reads |       614048
                            % of chimeric reads |       1.69%
alexdobin commented 5 months ago

Hi @gnilihzeux

I would recommend exploring the reads that were mapped by bowtie2 and not mapped by STAR.

gnilihzeux commented 4 months ago

@alexdobin Yes, I seemed have found what happened to unmapped reads, of which most are palindrome sequence beween Read1 and Read2.

However, I have not found a solution to this problem yet.

Some sequences are listed as follows

>@illumina:8501:1210 mate1
GAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACTCCCGCCGTTTACCCGCGCTTCATTGAATTTCTTCACTTTG
>@illumina:8501:1210 mate2
CAAAGTGAAGAAATTCAATGAAGCGCGGGTAAACGGCGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTC
>@illumina:36606:2009 mate1
AGCCGTCCCGGAGCCGGTCGCGGCGCACCGCCGCGGTGGAAATGCGCCCGGCGGCGGCCGGTCGCCGGTCGGGGGACGGTCCCCCGCCGACCCCACCCCCGGCCCCGCCCGCCCACCCCCGCACCCGCCGGAGCCCGCCCCCTCCGGGGA
>@illumina:36606:2009 mate2
GGCCGTGTCGGCGGCCCGGCGGATCTTTCCCGCCCCCCGTTCCTCCCGACCCCTCCACCCGCCCTCCCTTCCCCCGCCGCCCCTCCTCCTCCTCCCCGGAGGGGGCGGGCTCCGGCGGGTGCGGGGGTGGGCGGGCGGGGCCGGGGGTGG
>@illumina:36347:2009 mate1
ATCGGCGAGTGCTGCTGCCGGGGGGGCTGTAACACTCGGGGGGGGTTTCGGTCCCGCCGCCGCCGCCGCCGCCGCCACCGCCGCCGCGAGGGGGGGGGAATCA
>@illumina:36347:2009 mate2
TGATTCCCCCCCCCTCGCGGCGGCGGTGGCGGCGGCGGCGGCGGCGGCGGGACCGAAACCCCCCCCGAGTGTTACAGCCCCCCCGGCAGCAGCACTCGCCGAT
>@illumina:49804:3788 mate1
GTAGTTCACCATCTTTCGGGTCCTAACACGTGCGCTCGTGCTCCACCTCCCCGGCGCGGCGGGCGAGACGGGCCGGTGGTGCGCCCTCGGCGGACTGGAGAGGCATCGGGATCCCACCTCGGGAAGCG
>@illumina:49804:3788 mate2
CAAGGAGTCTAACACGTGCGCGAGTCGGGGGCTCGCACGAAAGCCGCCGTGGCGCAATGAAGGTGAAGGCCGGCGCGCTCGCCGGCCGAGGTGGGATCCCGAGGCCTCTCCAGTCCGCCGAGGGCGCACCACCGGCCCGTCTCGCCCGCC
>@illumina:38521:29680 mate1
GTTTCGGTCCCGCCGCCGCCGCCGCCGCCGCCACCGCCGCCGCCGCCGCCGCCCCGACCCGCGCGCCCTCCCGAGGGAGGACGCGGGGCCGGGGGGCGGAGACGGGGGAGGAGGAGGACGGACGGACGGACGGACGGGGCCCCCCGAGCC
>@illumina:38521:29680 mate2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>@illumina:46639:29680 mate1
TACTATTCAAAGTTCTTTTCAACTTTCCCTTACGGTACTTGTTGACTCCC
>@illumina:46639:29680 mate2
GGGAGTCAACAAGTACCGTAAGGGAAAGTTGAAAAGAACTTTGAATAGTA
>@illumina:48673:29712 mate1
CCCATTTAAAGTTTGAGAATAGGTTGAGATCGTTTTCGGCCCCAAGACCTCTAATCNTTCGCTTTACCGGATAAAACTGCGTGGCGGGGGTGCGTCGGGTCTGCGAGAGCGCCAGCTATCCTGAGGGAAACTTCGGAGGGAACCAGCTAC
>@illumina:48673:29712 mate2
GAAACTCTGGTGGAGGTCCGTAGCGGTCCTGACGTGCAAATCGGTCGTCCGACCTGGGTATAGGGGCNAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTNGCGCTCTCGCAGACCCGACGCAC