What is the memory usage for building custom database?

DaehwanKimLab / centrifuge

Classifier for metagenomic sequences

GNU General Public License v3.0

246 stars 73 forks source link

What is the memory usage for building custom database? #129

Open Confurious opened 6 years ago

Confurious commented 6 years ago

I read that one of the advantages of centrifuge is that it requires less space and memory than kraken. I am wondering if this is also true for the index building step? What is the rough ratio between the size of fasta database (assuming no compression) and the amount of memory required? How much memory was required to build a index on the NCBI nt database? Thanks!

mourisl commented 6 years ago

From my experience, you need about a little more than 3 times memory as much as the fasta file's size.

Confurious commented 6 years ago

Thanks! I am trying to index a rather large database (>500GB), anyone had experience with this type of task? I tried with Kraken but did not work after many weeks

On Sat, Jun 16, 2018 at 10:11 PM Li Song notifications@github.com wrote:

From my experience, you need about a little more than 3 times memory as much as the fasta file's size.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/infphilo/centrifuge/issues/129#issuecomment-397855250, or mute the thread https://github.com/notifications/unsubscribe-auth/AVzXMXb-B7ZveJjcCH-sMBG0ZGtweezsks5t9eTzgaJpZM4UqtMK .

-- Sincerely yours, Chao Jiang

shlomobl commented 6 years ago

Hi, So this explains why I get stuck when trying to index the bacteria WGS database including draft genomes (~470GB) with 125GB? It would be cool to be able to download the indexes instead... I work on veterinary microbiology and there are many bacterial species with only draft genomes.

Confurious commented 6 years ago

I am trying to index something >500 GB with 3 TB memory and so far it has been 7 days and it is not done yet (maybe not eve halfway?). Yes if someone has done it, it would be great to share, although I suspect it would not be so easy to share something that big

On Sat, Jun 23, 2018 at 11:24 PM shlomobl notifications@github.com wrote:

Hi, So this explains why I get stuck when trying to index the bacteria WGS database including draft genomes (~470GB) with 125GB? It would be cool to be able to download the indexes instead... I work on veterinary microbiology and there are many bacterial species with only draft genomes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/infphilo/centrifuge/issues/129#issuecomment-399732980, or mute the thread https://github.com/notifications/unsubscribe-auth/AVzXMS5AUYhsE2VrDE8mECCvF4sAonIjks5t_zCVgaJpZM4UqtMK .

-- Sincerely yours, Chao Jiang

mourisl commented 6 years ago

Did you run it with multiple threads?

Confurious commented 6 years ago

Yes, I used 32. i didn't know the node had 700 cpus. Would probably use more if this time it fails. 10 days now.

On Wed, Jun 27, 2018 at 6:09 PM Li Song notifications@github.com wrote:

Did you run it with multiple threads?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/infphilo/centrifuge/issues/129#issuecomment-400878651, or mute the thread https://github.com/notifications/unsubscribe-auth/AVzXMb3gUVTJqpKKIVWJxVVwC7-xBA5Pks5uBCzOgaJpZM4UqtMK .

-- Sincerely yours, Chao Jiang