Closed ilnamkang closed 6 years ago
Hi,
Not the developer but I think I can help. The list of RefSeq genomes (when using the -r
option) is indeed filtered. Here is what happens:
downloadlist.txt
), The filtering is done on line 271 of makeDB.sh
using awk
, as follows:
awk 'BEGIN{FS="\t";OFS="/"}$12=="Complete Genome" && $11=="latest"{l=split($20,a,"/");print $20,a[l]"_genomic.gbff.gz"}' assembly_summary.bacteria.txt assembly_summary.archaea.txt > downloadlist.txt
So, simply modifying this line to suit your needs should do the trick!
Regarding memory usage, the more sequences in the index, the more memory you will need. I sometimes use the NCBI non-redundant protein database (including all eukaryotes -- so about 150M protein sequences), and I get Kaiju to run using ~85GB of memory.
Hope it helps!
Thanks @fbucchini, you described it perfectly. :)
I would add that the memory requirement will increase a lot, so it's probably easier to just use the NR database with makeDB.sh -n
, which should contain all of the genes from RefSeq already.
Or use the proGenomes database, which contains representative proteins after clustering all bacterial genomes.
Thank you for your help.
It works nicely. After removing the condition for $12 from line 271, 'makeDB.sh -r' starts to download nearly all RefSeq genomes.
Unfortunately, I couldn't check the memory requirement because I don't have enough HDD space to save all the genomes temporarily. By simple calculation, the size of 'genomes' directory would be 400-500 GB if all RefSeq genomes are downloaded. If I can buy a new HDD, I'll try.
Hi,
Is it possible to create a database that contains ALL genomes of the NCBI RefSeq irrespective of completeness.
It seems that makeDB.sh is designed to download only complete genomes from the RefSeq when '-r' option is set.
Can I modify makeDB.sh to make it download all RefSeq genomes including draft genomes? If possible, how can I do that?
Is it impractical to use all RefSeq genomes due to some reasons such as memory requirement?
Thanks.