fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
224 stars 44 forks source link

krakenuniq-build missing parameter #36

Closed AlessioMilanese closed 5 years ago

AlessioMilanese commented 5 years ago

Hi,

I installed the latest version of KrakenUniq (on 22th November, 80ac242). In the README.md, line 42 I see:

Use krakenuniq-build --generate-taxonomy-ids-for-sequences ... to add pseudo-taxonomy ...

But there is no --generate-taxonomy-ids-for-sequences option in krakenuniq-build.

Also, instead (or in addition) of describing the differences with kraken, it would be really helpful to describe the steps of going from installation to profiling of a sample. For example:

  1. download the tool
    git clone https://github.com/fbreitwieser/krakenuniq
  2. install
    ./install_krakenuniq.sh
  3. download and build the database
    krakenuniq-build --threads 40 --standard --db DB
  4. run on samples ...
gexijin commented 5 years ago

I agree that a documentation for Krakenuniq would be very helpful. Many users never used the previous versions. I am also having trouble building the database. I just did: krakenuniq-download --db DB --taxa "archaea,bacteria,viral,fungi,protozoa,helminths" --dust --exclude-environmental-taxa nt

And then tried to build the database with krakenuniq-build --standard --threads 80 --download-taxonomy --db DB

My computer run for over a day, but only produced a database.jdb file.

fbreitwieser commented 5 years ago

Dear @AlessioMilanese and @gexijin , thank you for your feedback. I am now working on updating the README and MANUAL - I hope it will be easier to follow very soon!

@gexijin , the database building takes quite a while, especially on the nt database

fbreitwieser commented 5 years ago

@AlessioMilanese , to answer your original question, the parameters in question have been renamed to --taxids-for-genomes and --taxids-for-sequences. Let me know if you have further questions.

gexijin commented 5 years ago

Alessio, It will be great if you can even build a database for users to download. We can now download 100Gb files easily. And there are public repositories like Zenodo (50GB per file) and Figureshare (20GB per file). Thanks.

AlessioMilanese commented 5 years ago

Hi @gexijin, From NCBI on the 23rd of November 2018 the size of the database is 179Gb (you will need the same amount of RAM to run it). Note that I am not a developer of KrakenUniq and I would leave the upload to a public repository to one of the developers. (It is possible to load 200Gb file to Zenodo asking to increase the disk quota)

fbreitwieser commented 5 years ago

Updated the README. If there are more questions or problems with the build, classification, or feature request, please feel free to open another issue with a description of the problem.

We may provide databases on a regular basis in the future. For now, you may download the three databases described in the manuscript at ftp://ftp.ccb.jhu.edu/pub/software/krakenuniq/Databases/

gexijin commented 5 years ago

Hi Florian, The updated README is very helpful! Thank you. Can you double check if this line of code downloads both human genome and UniVec and EmVec? It looks like only human genome is downloaded.

Contaminant sequences from UniVec and EmVec, plus the human reference genome krakenuniq-download --db DBDIR refseq/vertegrate_mammalian/Chromosome/species_taxid=9606

Also, for the identification of microbes from human RNA-seq data, I found it very helpful to include human transcripts sequences. Is there any easy way to do that?