MosaikBuild: fail to build dat file from Homo_sapiens.GRCh37.60.dna.toplevel.fa.gz

What steps will reproduce the problem?
1.
Download Homo_sapiens.GRCh37.60.dna.toplevel.fa.gz from 
ftp://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/
2.
MosaikBuild -fr Homo_sapiens.GRCh37.60.dna.toplevel.fa.gz -oa 
Homo_sapiens.GRCh37.60.dna.toplevel.dat

It would fail on x86 architectures on std::badaloc. It is wise to catch this 
exception and explain the reason for the badaloc in the program (maximal 
allocable RAM in x86 is 4G).

On x64 it fail on 
ERROR: Found duplicate reference sequence names:
- HSCHR4_1
- Y

The problem is that the file contains multiple records from Y chromosome from 
different locations and it differs by whole row after >Y in fasta but program 
does not compare whole line but only part before first space.

This can be solved (in Windows x64) by changing mSequenceNameRegex in 
// specifies our sequence name regular 
expressionmosaik-aligner/src/CommonSource/Utilities/RegexUtilities.cpp
regex CRegexUtilities::mSequenceNameRegex("^[+>@]\\s*(.*$)");

I dont know if this solution will produce correct .bed files after matching 
sequences. I also dont know how to solve it under linux system.

Another problem on Windows architecture under Windows 7 x64 is crashing on 
pthread_join by some progressbar. I do think this bug is very similar to 
http://bugs.mysql.com/bug.php?id=26564 but i dont know yet how to solve it.

So after one week of play I was not able to build desired dat file.

Original issue reported on code.google.com by kulv...@gmail.com on 24 Nov 2010 at 9:30

duncanca / mosaik-aligner

MosaikBuild: fail to build dat file from Homo_sapiens.GRCh37.60.dna.toplevel.fa.gz #81