cbrueffer / tophat-recondition

Post-processor for TopHat unmapped.bam files making them usable by downstream software.
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

"Mapped mate should have mate reference name" problem #1

Closed drchriscole closed 9 years ago

drchriscole commented 9 years ago

Hi,

Really liking your tool, it should make submitting my BAMs to ENA much simpler. However, I'm getting the error above and have exactly the same features 'id0' does in the SeqAnswers thread. e.g. offending read is only in unmapped.bam and the MAPQ is changed from 255 to 0 by your script.

I'm not using any special settings in TopHat (v2.0.9), only -i to adjust intron size. Do you have any suggestions for getting around it, please? Thanks,

Chris

cbrueffer commented 9 years ago

Hi Chris, could you post a bit more information about the problematic read, e.g. samtools view unmapped.bam | grep name-of-he-read Is the mate of the problematic read mapped or unmapped? If the former, could you provide the same command as above for the accepted_hits.bam file?

drchriscole commented 9 years ago

This is error I get with Picardtools SortSam command: Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 40355112, Read name HWI-ST1398:69:C26EMACXX:8:1315:6828:52823, Mapped mate should have mate reference name

$ samtools view unmapped.bam | grep HWI-ST1398:69:C26EMACXX:8:1315:6828:52823 HWI-ST1398:69:C26EMACXX:8:1315:6828:52823 133 * 0 255 * * 0 0 AAAAAATCCAACCATACTAATACATTTCCTATAAAATAAACTTGTTGATTTATCATTTGTTAATAAATTATTTATTACATCTTTTTGCTCTTGTTCATCTA BC@FFFFFHHHHHJJJJJJJJJJIJJJJJJJJJJJJJJJIJJJJIJJJIJGHJJJJJJJJJIJJJJJJJJJJJJJJJJHHHHHEHFFFFEEEEEEEEEEDD $ samtools view unmapped_fixup.bam | grep HWI-ST1398:69:C26EMACXX:8:1315:6828:52823HWI-ST1398:69:C26EMACXX:8:1315:6828:52823 133 * 0 0 * * 0 0 AAAAAATCCAACCATACTAATACATTTCCTATAAAATAAACTTGTTGATTTATCATTTGTTAATAAATTATTTATTACATCTTTTTGCTCTTGTTCATCTA BC@FFFFFHHHHHJJJJJJJJJJIJJJJJJJJJJJJJJJIJJJJIJJJIJGHJJJJJJJJJIJJJJJJJJJJJJJJJJHHHHHEHFFFFEEEEEEEEEEDD $ samtools view accepted_hits.bam | grep HWI-ST1398:69:C26EMACXX:8:1315:6828:52823

cbrueffer commented 9 years ago

The problem appears to be that this is paired-end data (read paired / second in pair flags set), however the paired read is nowhere to be found. Picard looks for the paired read using the same read name, can't find it, and concludes it must have a different name, hence the error message. Are both reads present in the original FASTQ files? If yes, TopHat seems to throw the paired read away for some reason.

Basically there are two things one could try: 1. run Picard with VALIDATION_STRINGENCY=LENIENT, or 2. remove the read paired / second in pair flags so Picard doesn't think the read is paired-end and hence doesn't look for it's partner.

drchriscole commented 9 years ago

Yes, I see them in the FASTQs.

So, this isn't something you've seen before and been able to deal with then? It's wierd that TopHat is discarding reads.

cbrueffer commented 9 years ago

I think I've seen reports of this in the Tuxedo-suite Google group, but I've never encountered it myself.

I could add some code to deal with this case the way suggested above (adjusting the flags), but I won't have time to do it until next week (and it's going to be fairly expensive, since there's no quick way to see which reads are affected). Could you test running Picard with VALIDATION_STRINGENCY=LENIENT, to see if that works first?

drchriscole commented 9 years ago

With the LENIENT filter I see >1 million entries, but not for all runs: ~50% I note this entry for TopHat v2.0.11 (I'm using 2.0.9): "Improved support for adding unpaired reads to PE reads in the same TopHat2 run [..] This includes reporting separate counts for the additional unpaired reads and making sure that the SAM flags in the output files reflect the paired or unpaired origin of the reads." So, I'll try a newer version and see. Thanks for your help and suggestions.

cbrueffer commented 9 years ago

Turns out the approach outlined above was actually quick and easy to implement. Could you please try script from the branch missingreads?

Basically it records the names of the unmapped reads with mapped mate, crosses out the names it can find in the mapped file, and for the remaining ones (the unmapped reads that are marked as having a mapped mate, but the mate can't be found) adjusts the flags.

drchriscole commented 9 years ago

Thanks very much for doing this. OK, so I get different, maybe unrelated?, error from Picard: "ERROR: Record 61183083, Read name HWI-ST1398:69:C26EMACXX:8:1101:1135:2241, First of pair flag should not be set for unpaired read." Grepping in my merged file I see: HWI-ST1398:69:C26EMACXX:8:1101:1135:2241 68 * 0 0 * * 0 0 CAAACAGCAACAGATGCATCAATGACAGCAGAGGATGTATCAAAGATTGGAGAAGATTATGAGCTATTTTAATNNATACTTGAGGAACCAGGTGTCGATGA ;??DBDBD<D?:<?AE<+A+A<AE<+<A2A;@;E)?CDDD?D0:9BD)9/BD8@)8B@BD####################################### HWI-ST1398:69:C26EMACXX:8:1101:1135:2241 141 0 0 0 0 CCAACATCTCTTGNNNNNNTTCAGTTGTCTCTTTNNCAATGGTTGCTCCATCGACACCTGGTTCCTCAAGTATTGATAANNATGGATCATAATCTTCTCCA ;5<395???><@######################################################################################### unmapped.bam: HWI-ST1398:69:C26EMACXX:8:1101:1135:2241 69 * 0 255 * * 0 0 CAAACAGCAACAGATGCATCAATGACAGCAGAGGATGTATCAAAGATTGGAGAAGATTATGAGCTATTTTAATNNATACTTGAGGAACCAGGTGTCGATGA ;??DBDBD<D?:<?AE<+A+A<AE<+<A2A;@;E)?CDDD?D0:9BD)9/BD8@)8B@BD####################################### HWI-ST1398:69:C26EMACXX:8:1101:1135:2241 133 0 255 0 0 CCAACATCTCTTGNNNNNNTTCAGTTGTCTCTTTNNCAATGGTTGCTCCATCGACACCTGGTTCCTCAAGTATTGATAANNATGGATCATAATCTTCTCCA ;5<395???><@######################################################################################### unmapped_fixup.bam: HWI-ST1398:69:C26EMACXX:8:1101:1135:2241 68 * 0 0 * * 0 0 CAAACAGCAACAGATGCATCAATGACAGCAGAGGATGTATCAAAGATTGGAGAAGATTATGAGCTATTTTAATNNATACTTGAGGAACCAGGTGTCGATGA ;??DBDBD<D?:<?AE<+A+A<AE<+<A2A;@;E)?CDDD?D0:9BD)9/BD8@)8B@BD####################################### HWI-ST1398:69:C26EMACXX:8:1101:1135:2241 141 0 0 * 0 0 CCAACATCTCTTGNNNNNNTTCAGTTGTCTCTTTNNCAATGGTTGCTCCATCGACACCTGGTTCCTCAAGTATTGATAANNATGGATCATAATCTTCTCCA ;5<395???><@#########################################################################################

drchriscole commented 9 years ago

Running with the LENIENT flag reveals lots of reads with this problem.

cbrueffer commented 9 years ago

Thanks for testing! Could you test again with the latest version from the branch? This one resets a few more fields; if you still get an error it should be a different one.

drchriscole commented 9 years ago

Works! No Picard errors. Excellent, many thanks.

cbrueffer commented 9 years ago

Fantastic, thanks for testing! I'll add this to the main script and file a bug report with the TopHat people.

drchriscole commented 9 years ago

OK. I think this is less of a problem for newer versions of TopHat.

cbrueffer commented 9 years ago

I've filed the bug here: https://github.com/infphilo/tophat/issues/16 Could you add there which versions of TopHat you tested?

I'll release a new version of the script today, for your methods.

cbrueffer commented 8 years ago

Hi Chris, sorry for coming back to this, but regarding running Picard with VALIDATION_STRINGENCY=LENIENT you wrote "With the LENIENT filter I see >1 million entries, but not for all runs: ~50%". Could you elaborate on what you saw here? What did these numbers refer to? Did Picard crash or finish?