MosaikAligner - when using -om, mate-pairing is not output correctly

GoogleCodeExporter commented 8 years ago

A) You can reproduce the alignment by aligning paired end reads to a series of 
chromosomes which have homologous regions and then requesting the "-om" 
parameter to output all alignments.

B) You can make a fake, small fasta with just something exon-like structure, 
and then align to it, with -om, and you'll see the output.

>G1
regiona-regionb-regionc
>G2
regiona-regionc
>G3
regionb

C) The best output would be 

alignment
mate-alignment
alignment
mate-alignment
... etc

D) Mosaik currently outputs multiple alignments in a somewhat arbitrary way, 
with mates appearing "at the end", but with incorrect mate-pairing as a common 
occurance.  I'm not certain if this incorrect pairing is an artifact of the 
patch i had to apply to prevent Mosaik from crashing ... ie: if it did find all 
proper-pairs, but then failed to output one

E) EXAMPLE 1 INCORRECT OUTPUT (samtools view | grep seq.213536), WITH SEQ's 
REMOVED, for BREVITY:

seq.213536      81      GENE.10056      435     0       100M    =       351     
0       
seq.213536      81      GENE.10064      85      0       100M    GENE.10056      
351     0       
seq.213536      161     GENE.10056      351     0       100M    GENE.101117     
435     0       
seq.213536      161     GENE.10064      1       0       100M    GENE.101117     
435     0       

F) EXAMPLE 2, improper pair, when correct pairs (in both genes) exist:

The correct output should be two mate-pairs, but you get, instead only one-half 
of the alignments, the other half get junked because of the ref-index error 
(from http://code.google.com/p/mosaik-aligner/issues/detail?id=120):

seq.180534      97      GENE.10056      672     0       100M    GENE.10064      
448     0       
seq.180534      97      GENE.10064      322     0       100M    =       448     
0       

Partially correct output from bwa - but bwa and bowtie both miss the GENE.10056 
alignments.   Moasaik finds half of them.

seq.180534a     99      GENE.10064      322     29      100M    =       448     
226     
seq.180534b     147     GENE.10064      448     37      100M    =       322     
-226

Original issue reported on code.google.com by earone...@gmail.com on 29 Aug 2012 at 3:37

GoogleCodeExporter commented 8 years ago

Note: the incorrectness of example 1, is that it does not put an '=' in the 
mating... but it should... since they all have proper mates.

Original comment by earone...@gmail.com on 29 Aug 2012 at 3:38

GoogleCodeExporter commented 8 years ago

I think this is the reason why samtools will segfault.   Samtools has no 
problem with multiple mappings (bowtie works), as long as the mate pairing 
information is correct ie: each read that has the mate pit set actually has a 
mate, with an ID and an ISIZE and a POSITION.  

I would like to use Mosaik as an alternative to bowtie for RNA seq... it is a 
FAR better aligner for important data sets, bowtie misses up to 15% of 
alternative isoform alignments, and Mosaik gets all of them.   But the -om has 
a number of issues.

Original comment by earone...@gmail.com on 29 Aug 2012 at 7:33

duncanca / mosaik-aligner

MosaikAligner - when using -om, mate-pairing is not output correctly #121