duncanca / mosaik-aligner

Automatically exported from code.google.com/p/mosaik-aligner
0 stars 0 forks source link

MosaikSort looses all non-unique reads during sorting when using -a all|multi|single in MosaikAligner #45

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Align reads to reference with options for MosaikAligner set to -m all
and -a all
2. Sort the alignment with MosaikSort
3. Output sam files from both the unsorted and sorted alignment files with
MosaikText and compare the numbers of reads in them.

What is the expected output? What do you see instead?

I would expect the read numbers in the sam files to be the same regardless
of whether the alignment was sorted or not. Instead, the sam file generated
from the sorted alignment only contains the non-unique reads, as shown by
the Mosaik output:

- num reads in sam file made from unsorted alignment: 121386
- num reads in sam file made from sorted alignment: 81756

- Mosaik output:
Single-end read statistics:
======================================================
                     reads              alignments
------------------------------------------------------
# non-unique:      9326 (10.2 %)       39630 (32.6 %)
# unique:         81756 (89.8 %)       81756 (67.4 %)
------------------------------------------------------
total:            91082               121386

This to me suggests that the sorting process in MosaikSort eliminates all
non-unique reads and these are consequently excluded from the remainder of
the Mosaik pipeline, which affects downstream components such as the SNP
detection very badly (you get a large number of false negatives because the
*actual* coverage is not reflected by the reads present in the gig files).

What version of the product are you using? On what operating system?

Mosaik 1.0.1388 on 64-bit Linux (CentOS release 5.4)

Please provide any additional information below.

I can make the files from the example above available on request -- they
are too large to attach. I used the C. elegans example data provided with
Mosaik to compute the example shown above.  

I have tried this out with 3 completely separate datasets now and the
problem occurs in all cases.

I tried all possible permutations of parameter values for -m and -a and the
resulting matrix looks as follows:

-m all     -a all      fail
-m all     -a multi    fail
-m all     -a single   fail
-m all     -a fast     pass

-m unique  -a all      fail
-m unique  -a multi    fail
-m unique  -a single   fail
-m unique  -a fast     pass

This to me suggests that while the problem itself probably occurs in
MosaikSort it is at least correlated with the parameter settings in
MosaikAligner, and the scapegoat seems to be the -a parameter (the problem
is triggered by all values for this parameter except 'fast').

Original issue reported on code.google.com by micha.ba...@hutton.ac.uk on 17 Mar 2010 at 3:55

GoogleCodeExporter commented 9 years ago
Has this error been corrected?  I am using Mosaik version 1.0.1388, and I am 
seeing the same results.  Although MosaikAligner retains the unique AND 
non-unique reads, MosaikSort discards the non-unique reads.  This could be a 
serious problem downstream for me, because I would like to use DupSnoop to 
remove duplicates, but I also want to retain non-unique reads.

Laura

Original comment by laura.w...@gmail.com on 15 Feb 2011 at 5:18