duncanca / mosaik-aligner

Automatically exported from code.google.com/p/mosaik-aligner
0 stars 0 forks source link

MosaikBuild deletes 20% of reads randomly #77

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.  generate a dummy reads file test.fa containing 5 reads:
>D_1
CTAGCTAGCTAGCTAGCTACTAGCTGAGCTAGCAGTGCTATATATCTAGCTGCTAAGCTACAG
>D_2
CTCACTACTACAGAATA
>D_3
ACTATTATGTATGTATAATT
>D_4
AGCTACTAGCTACAGTACTATGTTACTAGCTGTGCTCT
>D_5
AGCTAGCTAGCTAGCAGCTGCTACTAGCTAGCTAGCTACTCCTATTACAGTATA
>D_6
CTACTGCTACTACAACTT
>D_7
AGCTAGCTAGCTATCTACTAGCGTACTATATATTATT
>D_8
ATCTGCTAATAATATA
>D_9
AGCTAGCTAGTAGCTAGCTAGCTAGCTTCACTACTGCTCA
>D_10
AGCTAGCTAGCTACTGCTAGCTATGCTCTACTAGCTACTACTACTACTAGCTATACTACTACTAAGCT

2. MosaikBuild -fr test.fa  -out test.dat -assignQual 99 -st "Helicos"

What is the expected output? What do you see instead?
All reads should be processed because there are no Ns. Instead, I saw
Filtering statistics:
============================================
# reads deleted:                 1 ( 20.0 %)
--------------------------------------------
# reads written:                 4
# bases written:               175

What version of the product are you using? On what operating system?
MosaikBuild 1.1.0018   Linux CentOS release 5 

Please provide any additional information below.

Original issue reported on code.google.com by piconano...@gmail.com on 10 Nov 2010 at 5:11

GoogleCodeExporter commented 8 years ago
Hi there,

The current MOSAIK only considers reads longer than 20bp.
In this case, D_2, D_6, and D_8 would be deleted by MosaikBuild.

I also did a test to lower the length threshold to 8bp, and that works well for 
your case. I mean no strange error coming out in the following steps.
Consequently, I'll consider to decrease the threshold in the next release.

Original comment by WanPing....@gmail.com on 10 Nov 2010 at 6:56

GoogleCodeExporter commented 8 years ago
Thank you very much. I think including reads below 20 nt is very important 
because in small RNA cloning and sequencing projects, people usually go as low 
as 15 nt or even lower.

I'm looking forward to the next release with the lower threshold. In the 
meantime, I do need to analyze some reads below 20 nt in the current project. 
Since mosaik is open source, is it possible to tell me how to do a "quick fix" 
of the threshold with the current release?

Thanks a lot!

Original comment by piconano...@gmail.com on 10 Nov 2010 at 8:40

GoogleCodeExporter commented 8 years ago
Hi,

Please check out src/MosaikBuild/MosaikBuild.h line 47.
#define MIN_READ_LENGTH     20

Please also note the hash size in MosaikAligner; the default value is 15.
If the length of a read is shorter than the hash size, then MosaikAligner won't 
align it. 
So, adjusting the hash size may be necessary for your case. The parameter is 
-hs #.

The effect of smaller hash sizes is a little bad performance and qualities. In 
my tests in human genome, hs 15, 14, and 13 have negligible differences.

The last thing is "-assignQual 99". Could you please set it to 60 which should 
be the highest quality in SAM/BAM? MosaikText would refuse to work for any 
qualities higher than 60.
(I should put another sanity checker in MosaikBuild, but I haven't done that.)

Please let me know how does it work. Any feedback would be greatly appreciated.

Original comment by WanPing....@gmail.com on 10 Nov 2010 at 10:27

GoogleCodeExporter commented 8 years ago
Thanks a lot.

So I changed MIN_READ_LENGTH and started compiling in my RedHat box (CentOS 
release 5).
According to README all I need to do is to type "make" and "make utils". 
When I typed "make" I got a lot of errors. See attached error log.

If possible, please see if what I need to do to compile it.

Original comment by piconano...@gmail.com on 10 Nov 2010 at 10:51

Attachments:

GoogleCodeExporter commented 8 years ago
It seems that gcc cannot find c++ libraries.
Do you get the same problem without changing MIN_READ_LENGTH?

Original comment by WanPing....@gmail.com on 10 Nov 2010 at 11:02

GoogleCodeExporter commented 8 years ago
Yes. Without changing the parameter gcc still got errors.

Original comment by piconano...@gmail.com on 10 Nov 2010 at 11:53

GoogleCodeExporter commented 8 years ago
It seems a system-dependent issue, 'cause I changed the parameter and compiled 
the suite in my opensuse 11.3 machine.
The issue is in my suse machine I can not compile c++ MosaikTools. In Issue 73 
I raised the question and kindly got your answer. I did what you suggested by 
adding
#include <stdint.h>
#include <stdio.h>
to MosaikAlignment.h.
But I still got errors:
MosaikAlignment.cpp: In static member function âstatic void 
Mosaik::CSequenceUtilities::Pack(std::string&, const st                         
   d::string&, const std::string&)â:
MosaikAlignment.cpp:2026:53: warning: array subscript has type âcharâ
MosaikAlignment.cpp:2026:84: warning: array subscript has type âcharâ
cc  -Wall -O3  -c -o fastlz.o fastlz.c
- linking MosaikConversion
g++  -Wall -O3  -c -o MosaikReaderMain.o MosaikReaderMain.cpp

So in a word, the major mosaik programs can be compiled in opensuse 11.3, but 
not CentOS 5; while the MosaikTools can be compiled in CentOS, but not Opensuse 
11.3

Original comment by piconano...@gmail.com on 11 Nov 2010 at 12:25

GoogleCodeExporter commented 8 years ago
Can you see two executable files, MosaikConversion and MosaikReaderTest, in 
MosaikTools/c++? If yes, I think that you successfully compile it with two 
warnings.

I'll check MOSAIK in CentOS with our system administrator who is much familiar 
with gcc than me.

Original comment by WanPing....@gmail.com on 11 Nov 2010 at 2:56

GoogleCodeExporter commented 8 years ago
Yes I can see the two executable files. But when I tried to run 
MosaikReaderTest,
I got this error:

ERROR: Found reference sequence tags, but support for reference sequence tags 
has not been implemented yet.

Thank you.

Original comment by piconano...@gmail.com on 11 Nov 2010 at 3:04

GoogleCodeExporter commented 8 years ago
Hi there,

It seems that the input MOSAIK archive is incomplete.
Does the previous step accomplish without errors?

Best,
Wan-Ping

Original comment by WanPing....@gmail.com on 11 Nov 2010 at 3:30

GoogleCodeExporter commented 8 years ago
My apology. It turns out that it is the Mosaik archive that caused the problem.
So now Mosaik (major programs and tools) are compiled and running fine in suse.

Original comment by piconano...@gmail.com on 11 Nov 2010 at 3:52

GoogleCodeExporter commented 8 years ago

Original comment by WanPing....@gmail.com on 16 Nov 2010 at 9:38