What steps will reproduce the problem?
1.
Download Homo_sapiens.GRCh37.60.dna.toplevel.fa.gz from
ftp://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/
2.
MosaikBuild -fr Homo_sapiens.GRCh37.60.dna.toplevel.fa.gz -oa
Homo_sapiens.GRCh37.60.dna.toplevel.dat
It would fail on x86 architectures on std::badaloc. It is wise to catch this
exception and explain the reason for the badaloc in the program (maximal
allocable RAM in x86 is 4G).
On x64 it fail on
ERROR: Found duplicate reference sequence names:
- HSCHR4_1
- Y
The problem is that the file contains multiple records from Y chromosome from
different locations and it differs by whole row after >Y in fasta but program
does not compare whole line but only part before first space.
This can be solved (in Windows x64) by changing mSequenceNameRegex in
// specifies our sequence name regular
expressionmosaik-aligner/src/CommonSource/Utilities/RegexUtilities.cpp
regex CRegexUtilities::mSequenceNameRegex("^[+>@]\\s*(.*$)");
I dont know if this solution will produce correct .bed files after matching
sequences. I also dont know how to solve it under linux system.
Another problem on Windows architecture under Windows 7 x64 is crashing on
pthread_join by some progressbar. I do think this bug is very similar to
http://bugs.mysql.com/bug.php?id=26564 but i dont know yet how to solve it.
So after one week of play I was not able to build desired dat file.
Original issue reported on code.google.com by kulv...@gmail.com on 24 Nov 2010 at 9:30
Original issue reported on code.google.com by
kulv...@gmail.com
on 24 Nov 2010 at 9:30