DaehwanKimLab / tophat

Spliced read mapper for RNA-Seq
http://ccb.jhu.edu/software/tophat
Boost Software License 1.0
90 stars 46 forks source link

Incorrect MAPQ values reported when running with -G #25

Open s-andrews opened 8 years ago

s-andrews commented 8 years ago

We’ve been getting some odd results in a single cell study and tracked these down to a set of genes which had odd clusters of mapped regions over only part of one exon of the gene.

When we looked at the BAM files for those reads they were all reported as unique best hits by tophat (single hit reported, MAPQ=50), but taking the original read sequence and mapping back to the genome found multiple 100% identity matches.

We did some tests and found that the reason was that if tophat was run with a gtf file supplied, that if it found a single hit within the transcriptome it never went on to map to the genome, so didn’t correctly determine that a hit wasn’t unique. Subsequent filtering by MAPQ value was therefore ineffective and this generated enough signal in our data to bias the downstream analysis.

An example to illustrate this would be to use the following paired end reads against Mouse GRCm38.

@HS32_17257:2:1101:1636:100042#69
AGCAACCACATGGTGGTTCACAACCATCTGTAACAAGATCTGACTCCCTCTTCTGGAGTGTCTGAAGACAGCTACAGTGTACTTACAT
+
CCCFFFFFHHHHHIJJIIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHIIIJJJIJJJJHHHHHFFFEDFFEEEEEE

@HS32_17257:2:1101:1636:100042#69
ATGTAAGTACACTGTAGCTGTCTTCAGACACTCCAGAAGAGGGAGTCAGATCTTGTTACAGATGGTTGTGAACCACCATGTGGTTGCT
+
@BCDDFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJIJIHIJJGIHIJJJJJJJJJJGJJJGIJIHJJJJIHIHHHHFFFFEEEDDDD

If you run: /bi/apps/tophat/2.0.12/tophat -p 6 -g 2 -G /bi/scratch/Genomes/Mouse/GRCm38/Mus_musculus.GRCm38.70.gtf -o single_artefact_tophat /bi/scratch/Genomes/Mouse/GR Cm38/Mus_musculus.GRCm38 single_artefact_1.fastq single_artefact_2.fastq

You get a single hit with a MAPQ of 50.

HS32_17257:2:1101:1636:100042#69        147     1       58757219        50      88M     =       58757219        -88     AGCAACCACATGGTGGTTCACAACCATCTGTAACAAGAT
CTGACTCCCTCTTCTGGAGTGTCTGAAGACAGCTACAGTGTACTTACAT       DDDDEEEFFFFHHHHIHIJJJJHIJIGJJJGJJJJJJJJJJIHIGJJIHIJIJJJJJJJJJJJJJJJIJJJJJJJHHHHHFFFDDCB@        AS:i:-5        XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:16C71      YT:Z:UU XS:A:+  NH:i:1
HS32_17257:2:1101:1636:100042#69        99      1       58757219        50      88M     =       58757219        -88     AGCAACCACATGGTGGTTCACAACCATCTGTAACAAGAT
CTGACTCCCTCTTCTGGAGTGTCTGAAGACAGCTACAGTGTACTTACAT       CCCFFFFFHHHHHIJJIIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHIIIJJJIJJJJHHHHHFFFEDFFEEEEEE        AS:i:-6        XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:16C71      YT:Z:UU XS:A:+  NH:i:1

However if you run the same thing without supplying a GTF file: /bi/apps/tophat/2.0.12/tophat -g 2 -o single_artefact_tophat_no_gtf /bi/scratch/Genomes/Mouse/GRCm38_ERCC/Mus_musculus.GRCm38_ERCC single_artefact_1.fastq single_artefact_2.fastq

You get multiple hits reported, where the primary hit is shown with a MAPQ of only 3

HS32_17257:2:1101:1636:100042#69        403     14      63224785        3       88M     =       63224785        -88     AGCAACCACATGGTGGTTCACAACCATCTGTAACAAGAT
CTGACTCCCTCTTCTGGAGTGTCTGAAGACAGCTACAGTGTACTTACAT       DDDDEEEFFFFHHHHIHIJJJJHIJIGJJJGJJJJJJJJJJIHIGJJIHIJIJJJJJJJJJJJJJJJIJJJJJJJHHHHHFFFDDCB@        AS:i:0XN:i:0   XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:88 YT:Z:UU NH:i:2  CC:Z:17 CP:i:82269288   HI:i:0
HS32_17257:2:1101:1636:100042#69        355     14      63224785        3       88M     =       63224785        -88     AGCAACCACATGGTGGTTCACAACCATCTGTAACAAGAT
CTGACTCCCTCTTCTGGAGTGTCTGAAGACAGCTACAGTGTACTTACAT       CCCFFFFFHHHHHIJJIIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHIIIJJJIJJJJHHHHHFFFEDFFEEEEEE        AS:i:0XN:i:0   XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:88 YT:Z:UU NH:i:2  CC:Z:17 CP:i:82269288   HI:i:0

I can’t really find a nice way to work around this within tophat, other than mapping everything twice (with and without GTF) and collating results afterwards. I get that doing the genomic mapping for reads where you found a hit in the transcriptome would be expensive, but at the moment we’re getting data which is reported incorrectly, and is having demonstrable negative effects on our datasets.

ndaniel commented 8 years ago

It looks like this is an issue regarding the definition of uniqueness. Uniquely mapping on transcriptome or uniquely mapping on transcriptome and genome. Is one sure that that read which is not mapping on the known transcriptome and is mapping on intronic/inter-genic region is not an artefact? Has this been validated in the wet-lab? What do we trust more? A read mapping on genome but not on transcriptome more than the known transcriptome defined by the GTF? Which way is the best to define uniqueness of mapping (transcriptome vs transcriptome-genome) for DEGs or for finding fusion genes or finding mutations?

In an RNA-seq experiment the assumption is that most likely most of the reads are coming from transcriptome (RNA is even in the name of the RNA-seq), like for example Kallisto is assuming, and therefore it makes sense for one to assume that reads mapping on transcriptome have highest priority.

Probably the most easy way around this is to modify the GTF by adding a line (and a mock transcript) to it which defines as a mock transcript the region around chr14:63224785. This would required to run only once TopHat using the new GTF. If there is strong evidence that the GTF is missing information or has errors then probably the GTF should be modified and not the definition of the uniqueness?

s-andrews commented 8 years ago

The example I posted is only one of a bunch of genes where we've seen the same effect (tens of thousands of reads). In these cases we see a short region (~100-200bp) with a ton of reads which sits within one exon of an otherwise unexpressed transcript. If you look at the reads under that region they are generally shorter than those elsewhere in the genome and if you map them with bowtie2 to the genome you get multiple100% identity hits for each one.

There's no way these mapped positions are correct, and their combined effect is strong enough that it comes out as the first principal component in an analysis across a large single cell RNA-Seq dataset which is why we started looking at this.

We've also looked at the regions where we find these mappings in some of the datasets Ensembl makes available as tracks. Under pretty much all of them you see a huge block of SNP predicitions, a set of fusion gene predictions and (oddly) a hit from a study predicting CTCF binding sites. I'd be willing to wager a pint or two that all of these results are rubbish and are the result of the same mis-mapping artefact we're seeing.

ndaniel commented 8 years ago

The example I posted is only one of a bunch of genes where we've seen the same effect (tens of thousands of reads). In these cases we see a short region (~100-200bp) with a ton of reads which sits within one exon of an otherwise unexpressed transcript. If you look at the reads under that region they are generally shorter than those elsewhere in the genome and if you map them with bowtie2 to the genome you get multiple100% identity hits for each one. There's no way these mapped positions are correct, and their combined effect is strong enough that it comes out as the first principal component in an analysis across a large single cell RNA-Seq dataset which is why we started looking at this. We've also looked at the regions where we find these mappings in some of the datasets Ensembl makes available as tracks. Under pretty much all of them you see a huge block of SNP predicitions, a set of fusion gene predictions and (oddly) a hit from a study predicting CTCF binding sites. I'd be willing to wager a pint or two that all of these results are rubbish and are the result of the same mis-mapping artefact we're seeing.

Let's assume that this finding is correct. Then you have found new transcripts/exons which still need to be validated in the wet-lab ;-) Afterwards one may submit the new found transcripts/exons to Ensembl/GeneCode/Refseq/etc. in order to have them reflected in future GTF files such that the whole research community can use it.

Anyway, the solution is very simple in this case. Just add one mock transcript for each of mouse's chromosomes to the GTF. The mock transcript would contain just on exon which starts from the first nucleotide of the chromosome and end with the last nucleotide of the chromosome. If you use a stranded RNA-seq protocol then it is needed to add two mock transcripts per chromosome (one for each strand). Of course this GTF will be used only for mapping and not for counting the reads (for counting the reads use the original GTF). This should give you the effect you desire! That's it!

touala commented 8 years ago

Hi, I think you can prevent it to append using this option --read-realign-edit-dist 0. It will realign reads at every steps (transcriptome -> genome -> splice junction).

s-andrews commented 8 years ago

Thanks for the suggestions. I've been doing some work on this looking at a few different aligners. To illustrate how much of a potential problem this is (at least with the dataset we have), take a look at this scatterplot of the log2 counts from tophat and hisat2. All of the really problematic genes we hit are in the line at the bottom of the plot, and the ones which are off-axis in the body of the plot also seem to be cases where potential multi-mapping wasn't spotted.

I get that there is a mapping advantage in assuming that everything in the library is either a known transcript, or completely novel transcription with no similarity to an existing transcript, but given the unwanted side effects we've seen in real data, and the inability to filter these under default settings then there are at least some cases where this can give quite misleading results.

tophat_vs_hisat_scatterplot

ndaniel commented 8 years ago

Here is another problem of aligners regarding pseudo-genes alignment.

Anyway, the current biology knowledge says that very large portions of genome (i.e. junk DNA) are expressed as RNA (it is basically noise). If a DNA is transcribed as RNA does not mean that it has a biological function also. Therefore most of the transcriptome today contains RNAs known and proven to have a biological function.

From what I see here is that indeed there are regions of genome which are highly expressed as RNA under different conditions. This is nothing new at all. The starting hypothesis here is that this is junk DNA which is expressed as RNA and it has no biological function. In order to have it included in the transcriptome then one would need to reject this hypothesis thru wet-lab work.

More about this is here in this very good article: D. Graur et al. On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE, 2013.