DaehwanKimLab / tophat

Spliced read mapper for RNA-Seq
http://ccb.jhu.edu/software/tophat
Boost Software License 1.0
91 stars 46 forks source link

Missing reads in accepted_hits.bam in some (as yet unknown) conditions #16

Open cbrueffer opened 9 years ago

cbrueffer commented 9 years ago

Problem

With paired-end reads where one mate is unmapped and the other is mapped (at least according to the "mate is mapped" flag of the unmapped read), there are cases where the mapped read goes missing (i.e., it's not in either accepted_hits.bam or unmapped.bam). The missing read is present in the original FASTQ files. The result is that downstream tools like Picard stumble over these reads.

Reproducability

The conditions that cause this to happen are currently unknown. @drchriscole can reproduce this on his data (default TopHat flags except for -i).

Workaround

This problem can be worked around by finding the unmapped reads with "mate is mapped" flag set, checking for the mate's existence and resetting the mate-related flags of the unmapped reads where the mate is absent. I added this functionality to https://github.com/cbrueffer/tophat-recondition with the help of @drchriscole

drchriscole commented 9 years ago

This is for data generated with v2.0.9 of TopHat. I know it's not the most recent, but it's for an experiment which we now want to publish.

cbrueffer commented 9 years ago

Looking at the changelog, the only change I see that could be related to this is in 2.0.10, "Fixed a bug that could sometimes incorrectly rename the reads in the output alignments".

Would you be able to test whether this problem still happens with a recent version?

drchriscole commented 9 years ago

I can confirm this occurs in TopHat 2.1.0 TopHat command: CL:tophat -G DictyEnsemble.gtf -p 8 -I 146 -o DGC-RI2-A DictyEnsemble.fas 14_LIB375_LDI316_CCGTCC_L008_R1_001.fastq.gz 14_LIB375_LDI316_CCGTCC_L008_R2_001.fastq.gz -o DGC-RI2-A samtools (v0.1.20) merge: samtools merge merged_RI2A.bam DGC-RI2-A/*.bam Picard SortSam (v1.95): SortSam I=merged_RI2A.bam O=sorted_RI2A.bam SO=coordinate CREATE_INDEX=true [Thu Aug 20 17:27:27 BST 2015] net.sf.picard.sam.SortSam INPUT=merged_RI2A.bam OUTPUT=sorted_RI2A.bam SORT_ORDER=coordinate CREATE_INDEX=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false [Thu Aug 20 17:27:27 BST 2015] Executing as xxxxxxx on Linux 2.6.32-504.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_05-b13; Picard version: 1.95(1496) INFO 2015-08-20 17:29:44 SortSam Read 10,000,000 records. Elapsed time: 00:02:16s. Time for last 10,000,000: 136s. Last read position: 2:2,788,485 INFO 2015-08-20 17:31:12 SortSam Read 20,000,000 records. Elapsed time: 00:03:44s. Time for last 10,000,000: 88s. Last read position: 3:142,580 INFO 2015-08-20 17:32:08 SortSam Read 30,000,000 records. Elapsed time: 00:04:40s. Time for last 10,000,000: 55s. Last read position: 4:1,216,604 INFO 2015-08-20 17:33:01 SortSam Read 40,000,000 records. Elapsed time: 00:05:33s. Time for last 10,000,000: 53s. Last read position: 5:3,114,128 [Thu Aug 20 17:33:49 BST 2015] net.sf.picard.sam.SortSam done. Elapsed time: 6.36 minutes. Runtime.totalMemory()=3865575424 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 48757728, Read name HWI-ST1398:69:C26EMACXX:8:1101:1239:2242, Mapped mate should have mate reference name at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448) at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:541) at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522) at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650) at net.sf.picard.sam.SortSam.doWork(SortSam.java:68) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119) at net.sf.picard.sam.SortSam.main(SortSam.java:57)