bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
272 stars 66 forks source link

Creating a database that contains ALL genomes of RefSeq #81

Closed ilnamkang closed 6 years ago

ilnamkang commented 6 years ago

Hi,

Is it possible to create a database that contains ALL genomes of the NCBI RefSeq irrespective of completeness.

It seems that makeDB.sh is designed to download only complete genomes from the RefSeq when '-r' option is set.

Can I modify makeDB.sh to make it download all RefSeq genomes including draft genomes? If possible, how can I do that?

Is it impractical to use all RefSeq genomes due to some reasons such as memory requirement?

Thanks.

fbucchini commented 6 years ago

Hi,

Not the developer but I think I can help. The list of RefSeq genomes (when using the -r option) is indeed filtered. Here is what happens:

  1. Download the assembly summaries for archaea and bacteria,
  2. Filtering to keep only complete genomes (latest) to create a file containing the list of URLs to download from (downloadlist.txt),
  3. Use the retrieved URLs to download the genomes...

The filtering is done on line 271 of makeDB.sh using awk, as follows:

awk 'BEGIN{FS="\t";OFS="/"}$12=="Complete Genome" && $11=="latest"{l=split($20,a,"/");print $20,a[l]"_genomic.gbff.gz"}' assembly_summary.bacteria.txt assembly_summary.archaea.txt > downloadlist.txt

So, simply modifying this line to suit your needs should do the trick!

Regarding memory usage, the more sequences in the index, the more memory you will need. I sometimes use the NCBI non-redundant protein database (including all eukaryotes -- so about 150M protein sequences), and I get Kaiju to run using ~85GB of memory.

Hope it helps!

pmenzel commented 6 years ago

Thanks @fbucchini, you described it perfectly. :)

I would add that the memory requirement will increase a lot, so it's probably easier to just use the NR database with makeDB.sh -n, which should contain all of the genes from RefSeq already. Or use the proGenomes database, which contains representative proteins after clustering all bacterial genomes.

ilnamkang commented 6 years ago

Thank you for your help.

It works nicely. After removing the condition for $12 from line 271, 'makeDB.sh -r' starts to download nearly all RefSeq genomes.

Unfortunately, I couldn't check the memory requirement because I don't have enough HDD space to save all the genomes temporarily. By simple calculation, the size of 'genomes' directory would be 400-500 GB if all RefSeq genomes are downloaded. If I can buy a new HDD, I'll try.