BenLangmead / bowtie

An ultrafast memory-efficient short read aligner
Other
257 stars 76 forks source link

Bowtie hangs when running on very large genomes #124

Closed ferayd closed 2 years ago

ferayd commented 3 years ago

I downloaded the "bread wheat" genome from here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/220/415/GCA_002220415.3_Triticum_4.0/GCA_002220415.3_Triticum_4.0_genomic.fna.gz

Then I ran bowtie-build: bowtie-build GCA_002220415.3_Triticum_4.0_genomic.fna WHEAT_JHU4_genome

Then I created a query input file (raw) called reads.txt, with a single sequence in it. For example: AAAAAAAAAAAAAAAAAAAA

Then I ran bowtie: bowtie -x WHEAT_JHU4_genome -r reads.txt

It hangs. If I split the genome to two pieces, it works well for each piece. So I think this problem is because of the size of the genome.

I am attaching the verbose output. It shows where bowtie hangs: bowtie_verbose.txt

Thanks

ch4rr0 commented 3 years ago

I am not able to get bowtie-build to successfully build the index, but if I use a bowtie2 index then the alignment runs successfully.

I am investigating why bowtie-build is not able to build the index.

./bowtie-align-l -x ../bowtie2/triticum -r reads.raw
0   +   CM022213.1  49156482    AAAAAAAAAAAAAAAAAAAA    IIIIIIIIIIIIIIIIIIII    21146   
# reads processed: 1
# reads with at least one alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments
blakemeyers commented 2 years ago

I am also wondering about the status of the fix to this problem, as we have multiple large genomes that we need to use - the same scale as wheat (i.e. rye, barley, oat). Any updates on either the original problem (hanging during mapping on the indexed genome) or the second problem (failure to build the index)?

thank you!

ch4rr0 commented 2 years ago

I am still actively looking into this one, but have yet to figure out the underlying cause of this bug. As a temporary work around you can build those indexes with bowtie2 and use them for alignment in bowtie.

blakemeyers commented 2 years ago

This work around would only address the secondary issue that you encountered with indexing, but not the original problem of bowtie hanging during the mapping/aligning stage, right? I don't think the problem originally reported was with indexing.

ch4rr0 commented 2 years ago

I am not able to get bowtie-build to successfully build the index, but if I use a bowtie2 index then the alignment runs successfully.

I am investigating why bowtie-build is not able to build the index.

./bowtie-align-l -x ../bowtie2/triticum -r reads.raw
0 +   CM022213.1  49156482    AAAAAAAAAAAAAAAAAAAA    IIIIIIIIIIIIIIIIIIII    21146   
# reads processed: 1
# reads with at least one alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments

I am quite confident it is an index issue.

ch4rr0 commented 2 years ago

We have finally tracked down and pushed a fix for this bug to the bug_fixes branch. We thank all of you who have been impacted by this issue for your patience, and are in the process of putting together an official release which will include this change.

./bowtie-build-l GCA_002220415.3_Triticum_4.0_genomic.fna triticum --threads 12 --packed
...
Wrote 4422321688 bytes to primary EBWT file: triticum.1.ebwtl
Wrote 3849428340 bytes to secondary EBWT file: triticum.2.ebwtl
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 15397713314
    bwtLen: 15397713315
    sz: 3849428329
    bwtSz: 3849428329
    lineRate: 7
    linesPerSide: 1
    offRate: 5
    offMask: 0xffffffffffffffe0
    isaRate: -1
    isaMask: 0xffffffff
    ftabChars: 10
    eftabLen: 20
    eftabSz: 160
    ftabLen: 1048577
    ftabSz: 8388616
    offsLen: 481178542
    offsSz: 3849428336
    isaLen: 0
    isaSz: 0
    lineSz: 128
    sideSz: 128
    sideBwtSz: 112
    sideBwtLen: 448
    numSidePairs: 17184948
    numSides: 34369896
    numLines: 34369896
    ebwtTotLen: 4399346688
    ebwtTotSz: 4399346688
    reverse: 0
...
Wrote 4422321688 bytes to primary EBWT file: triticum.rev.1.ebwtl
Wrote 3849428340 bytes to secondary EBWT file: triticum.rev.2.ebwtl
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 15397713314
    bwtLen: 15397713315
    sz: 3849428329
    bwtSz: 3849428329
    lineRate: 7
    linesPerSide: 1
    offRate: 5
    offMask: 0xffffffffffffffe0
    isaRate: -1
    isaMask: 0xffffffff
    ftabChars: 10
    eftabLen: 20
    eftabSz: 160
    ftabLen: 1048577
    ftabSz: 8388616
    offsLen: 481178542
    offsSz: 3849428336
    isaLen: 0
    isaSz: 0
    lineSz: 128
    sideSz: 128
    sideBwtSz: 112
    sideBwtLen: 448
    numSidePairs: 17184948
    numSides: 34369896
    numLines: 34369896
    ebwtTotLen: 4399346688
    ebwtTotSz: 4399346688
    reverse: 0

ls triticum*.ebwtl
triticum.1.ebwtl  triticum.3.ebwtl  triticum.rev.1.ebwtl
triticum.2.ebwtl  triticum.4.ebwtl  triticum.rev.2.ebwtl

./bowtie-align-l -x triticum -c AAAAAAAAAAAAAAAAAAAA
0   +   CM022213.1  49156482    AAAAAAAAAAAAAAAAAAAA    IIIIIIIIIIIIIIIIIIII    21146   
# reads processed: 1
# reads with at least one alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments
blakemeyers commented 2 years ago

Thank you so much! I really appreciate this. We'll test it out and let you know if we encounter any issues.

ch4rr0 commented 2 years ago

This change is now available in v1.3.1. Thank you for providing sample files and for helping test.