Building index of NCBI's Refseq bacterial genomes

BenLangmead / bowtie2

A fast and sensitive gapped read aligner

GNU General Public License v3.0

638 stars 160 forks source link

Building index of NCBI's Refseq bacterial genomes #431

Open bheimbu opened 1 year ago

bheimbu commented 1 year ago

Hi there,

I'm trying a build a huge index of NCBI's Refseq bacterial genomes, which is about 97 GB (in fna.gz format). I'm working on a HPC with 512 GB RAM but it still dies always with an "out-of-memory" error. Is it possible to split up the compressed fasta file in smaller chunks, index them separately, and then concatenate the resulting indexing files in the end? Or is there another solution (use more RAM)?

Cheers Bastian

ch4rr0 commented 1 year ago

Hello,

There are a few options available to you:

bowtie2-build has a --packed mode that should reduce the memory footprint but is slower than the standard build.
Split the FASTA, build indexes with the resulting files, and run separate alignments against each index. N.B. indexes cannot be merged.
Use a node with more memory.

bheimbu commented 1 year ago

Thanks,

for your reply. I'll try to use --packed and see how it goes.

Cheers Bastian

JSSaini commented 4 months ago

Hello, I guess there might be many genomes in NCBI collection which may be very similar or possibly identical too. How does bowtie performs the read assigment in this case? It randomly assignes reads to one sequence from the pool of identical sequences? or it equally distribute the reads to all identical sequences? Thank you. I know in ideal scenario if is good to dereplicate genomes first.

ch4rr0 commented 4 months ago

Hello,

bowtie2 will chose the alignment with the highest alignment score. If there are multiple of these it will chose an alignment at random. I hope this helps.