Isoforms Results Contains Large Expected Counts When Unsupported by Reads

DarioS commented 7 years ago

I manually inspected the top isoforms and I think I could have found a major bug in RSEM.

Consider a particular transcript quantitated in two samples. In the first sample, RSEM estimates 0 counts.

ENST00000589265.5   ENSG00000030582.16  1632    1468.72 0.00    0.00    0.00    0.00

But, in the second sample, it estimates 1069 counts.

ENST00000589265.5   ENSG00000030582.16  1632    1465.16 1068.58 30.85   21.32   5.18

This result contradicts the contents of the transcript-wise BAM files. All mappings for both samples have mapping quality of 0.

beforefilter

After filtering, it can be observed that all of the reads are removed.

filter

afterfilter

Also, I mapped this dataset to the genome with STAR and viewed the mapping result by IGV's Sashimi Plot and there are 0 reads supporting the junction indicated by pink arrows.

grngenome

This problem causes many of the top-ranked transcripts to be false positives, because they really should be 0 reads in both conditions. I am using RSEM 1.3 with -p 2 --strandedness reverse --paired-end options. I can share the input files, if needed.

bli25wisc commented 7 years ago

@DarioS , for transcript bams, do you refer to RSEM-generated BAM files or aligner-generated BAM files? It is likely that in sample2 the read mapping quality is slightly better than in sample1 and therefore get some read counted. RSEM does not use a hard threshold, such as mapping quality of 1. Instead, for each alignment, it calculates a posterior probability that the read should come from the alignment. Therefore, if you have a huge amount of low quality alignments, you will get some read counts in the end. Suppose you have one low quality alignment and you think the probability that it is true is 0.001. If you only have one alignment, you can safely discard it because it is unlikely to be true. However, if you have 1M of these alignments, the expected number of "true" alignments will become 1000.

Hope it helps.

DarioS commented 7 years ago

There are no alignments which specifically support this transcript in either sample. They are output by RSEM. Are you sure that RSEM operates correctly in the case where one transcript contains a subset of the exons of another, but none that are unique to itself? I observe unusual counts for these instances, which frequently occur in the GENCODE annotation.

bli25wisc commented 7 years ago

@DarioS , can you share with us some data so that we can look into details?

Thanks, Bo

acesnik commented 6 years ago

We're looking specifically at splice forms of MALAT1, and we're running into similar trouble as reported above.

It looks like this issue may have gone dormant, so I wanted to provide some example data. We tried to quantify them with SRX1603582 from GEO SRA using options --time --calc-ci --star --star-path --output-genome-bam --sort-bam-by-coordinate --sort-bam-memory-per-thread --paired-end.

The MALAT1 Sashimi plot for the genome.sorted.bam file looks a bit different than the Sashimi plot for the Aligned.sortedByCoord.out.bam file, but the problem still exists. For example, one MALAT1 transcript supposedly has 90 TPM, but no supporting reads spanning one of the exon-exon junctions in either file. Similar for another MALAT1 transcript: 76 TPM and no supporting junction-spanning reads.

deweylab / RSEM

Isoforms Results Contains Large Expected Counts When Unsupported by Reads #43