deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
408 stars 118 forks source link

Output BAM File Contains Inconsistent Read IDs #37

Closed DarioS closed 7 years ago

DarioS commented 7 years ago

I used the default bowtie method for rsem-calculate-expression but I had problems determining the number of reads in the resulting BAM file. I did some investigating and found that the reason is the read IDs have different numbers of fields for the first and second reads of a read pair. For example,

$ samtools view 30588WD_PRE.transcript.bam | grep 700666F:126:C8768ANXX:1:1103:1958:2081
700666F:126:C8768ANXX:1:1103:1958:2081  77      *       0       0       *       *       0       0       ATTTTTTTTCTTTATAAATTACGCAATCTATGGTATTCTCTTATAGCAACAGAAAACAGACTAAGACAACCATATTCCCAGTGCTTAGAACAGCCTCTGT    CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG1F@111<1FCFFGGB>1<CEGGEGGFGGBFCGGG1CCGCF@FFGGGGG0B0F>BGG0E   XM:i:0
700666F:126:C8768ANXX:1:1103:1958:2081 2:N:0:GTCCGC     141     *       0       0       *       *       0       0       GTTTATTCAACAGTTTATTCAAAACACGTTTATTGATCATCTCCTGTGAGACAGAGGCTGTTCTAAGCACTGGGAATATGGTTGTCTTAGTCTGTTTTCT   BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFGGGGG    XM:i:0

The extra 2:N:0:GTCCGC is only kept for second reads, but not for first. Could this be consistent for both pairs of reads?

I also mapped this data with STAR to the genome and the BAM file has consistent naming formats of all the read IDs.

$ samtools view 30588WD_PREAligned.sortedByCoord.out.bam | grep 700666F:126:C8768ANXX:1:1103:1958:2081
700666F:126:C8768ANXX:1:1103:1958:2081  163     chr22   17560462        255     100M    =       17560512        150     GTTTATTCAACAGTTTATTCAAAACACGTTTATTGATCATCTCCTGTGAGACAGAGGCTGTTCTAAGCACTGGGAATATGGTTGTCTTAGTCTGTTTTCT   BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFGGGGG    NH:i:1  HI:i:1  AS:i:198        nM:i:0
700666F:126:C8768ANXX:1:1103:1958:2081  83      chr22   17560512        255     100M    =       17560462        -150    ACAGAGGCTGTTCTAAGCACTGGGAATATGGTTGTCTTAGTCTGTTTTCTGTTGCTATAAGAGAATACCATAGATTGCGTAATTTATAAAGAAAAAAAAT   E0GGB>F0B0GGGGGFF@FCGCC1GGGCFBGGFGGEGGEC<1>BGGFFCF1<111@F1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC    NH:i:1  HI:i:1  AS:i:198        nM:i:0
bli25wisc commented 7 years ago

Hi @DarioS , we are aware of this behavior of Bowtie. If you update to the latest version of RSEM, RSEM will extract the canonical read name (with strings after spaces discarded).