DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

Database download for Centrifuge #242

Open ramnageena11 opened 1 year ago

ramnageena11 commented 1 year ago

Hi, I executed the below command to download Centrifuge database: centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

The command is running from last 7 days. following status is Progress : [##-----------------------] 5% 1525/27880environment: line 28: dustmasker: command not found Progress : [##-----------------------] 5% 1526/27880environment: line 28: dustmasker: command not found Progress : [##-----------------------] 5% 1527/27880environment: line 28: dustmasker: command not found Progress : [##-----------------------] 5% 1528/27880

Pls suggest, do i need to kill the command or is it fine?

Thanks RNS

ramnageena11 commented 1 year ago

Hi All, Pls comment and suggest.

Thanks RNS

fanninpm commented 1 year ago

What is the output of command -v dustmasker?

ramnageena11 commented 1 year ago

Pls find attached the screenshot of the status: [image: image.png]

pls suggest. shall I kill the command? Thanks rgds Ram Ram Nageena Singh, Ph.D (Microbiology)

On Tue, Aug 2, 2022 at 7:49 AM fanninpm @.***> wrote:

What is the output of command -v dustmasker?

— Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/centrifuge/issues/242#issuecomment-1202610281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4LETUGHCUFPH4XCRGFY63VXEROBANCNFSM54ISFHAA . You are receiving this because you authored the thread.Message ID: @.***>

fanninpm commented 1 year ago

The attachment was scrubbed. Please log in to GitHub to attach it. Alternatively, you can copy and paste the text output from the terminal (the trick is to add the SHIFT key when copying/pasting from a terminal application).

ramnageena11 commented 1 year ago

Hi, I have made a query thread on Github page (#242). Pls see that.

Thanks rgds Ram Ram Nageena Singh, Ph.D (Microbiology)

On Tue, Aug 2, 2022 at 11:33 AM fanninpm @.***> wrote:

The attachment was scrubbed. Please log in to GitHub to attach it. Alternatively, you can copy and paste the text output from the terminal (the trick is to add the SHIFT key when copying/pasting from a terminal application).

— Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/centrifuge/issues/242#issuecomment-1203021499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4LETUUWMBCFOMYQVB75SDVXFLVVANCNFSM54ISFHAA . You are receiving this because you authored the thread.Message ID: @.***>

fanninpm commented 1 year ago

image

ramnageena11 commented 1 year ago

Hi, PLs see the below (Terminal status). Now it is more than 15 days.

Progress : [#######---------------------------------] 19% 5462/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5463/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5464/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5465/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5466/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5467/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5468/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5469/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5470/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5471/27880environment: line 28: dustmasker: command not found Progress : [#######---------------------------------] 19% 5472/27880environment: line 28: dustmasker: command not found

fanninpm commented 1 year ago

Feel free to kill the process and add dustmasker.

ramnageena11 commented 1 year ago

Hi, the following command is running centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

is there anyother script to add dustmasker?

Thanks RNS

fanninpm commented 1 year ago

How did you install centrifuge?

ramnageena11 commented 1 year ago

Hi, I did installation using:

conda install -c bioconda centrifuge Collecting package metadata (current_repodata.json): done Solving environment: done

Package Plan

environment location: /home/majorram/anaconda3/envs/diversity

added / updated specs:

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
centrifuge-1.0.4_beta      |py36pl526he941832_2         3.9 MB  bioconda
------------------------------------------------------------
                                       Total:         3.9 MB

The following NEW packages will be INSTALLED:

centrifuge bioconda/linux-64::centrifuge-1.0.4_beta-py36pl526he941832_2 perl conda-forge/linux-64::perl-5.26.2-h36c2ea0_1008

Proceed ([y]/n)? y

Downloading and Extracting Packages centrifuge-1.0.4_bet | 3.9 MB | ################################################################################################################################################################ | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done

fanninpm commented 1 year ago

If you want to make a database from scratch, then you also need BLAST, Jellyfish, and MUMmer.

# after activating your conda virtual environment
conda install -c bioconda blast
conda install -c bioconda kmer-jellyfish
conda install -c bioconda mummer

However, if you want to use a pre-built database, then you don't need those three pieces of software.

(By the way, if you don't mind working with a database a few years out of date, Ben Langmead has a GitHub Pages website with links to pre-built Centrifuge databases.)

ramnageena11 commented 1 year ago

Hi, in Conda (base environment) i have blast and in separate environment mummer. will install these in the current environment.

I will prefer with updated database. thanks for suggestion.

RNS

ramnageena11 commented 1 year ago

shall i run this script centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

after installing all three softwares?

what was the issue with "dustmaker command not found" ?

thanks RNS

fanninpm commented 1 year ago

what was the issue with "dustmaker command not found" ?

In order to build a database from scratch, the dustmasker tool is necessary. The database builder couldn't find the dustmasker command in $PATH, so it printed that warning to the screen.

You may have noticed that "dustmasker" was not in the names of the three packages I mentioned. The dustmasker command is found in the blast package.

shall i run this script centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map after installing all three softwares?

You can. However, it may be slightly more convenient to use the Makefile. If make is not accessible from your environment, you can install it from conda-forge:

conda install -c conda-forge make

To see what the Makefile can do, you can invoke it without setting any of its options (note that the -C flag tells make where the Makefile is in your conda environment):

make -C "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices

You might be looking for something like this, which is similar to the p_compressed+h+v database:

make -C "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices \
    THREADS=0 IDX_NAME='p_compressed+v' ANY_LEVEL_GENOMES='viral' COMPLETE_GENOMES_COMPRESSED='archaea bacteria'

(Make sure to specify the amount of threads you're working with.)

ramnageena11 commented 1 year ago

I am new to the environment. thanks for explaining it.

I have "make" in my environment.

make makeconv makembindex make-ssl-cert
makeblastdb make-first-existing-target makeprofiledb mako-render

Do i need to specify the? dirname= command= in make -C "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices

Thanks RNS

fanninpm commented 1 year ago

dirname is a command that is part of the GNU coreutils. If you have a question for what it does, try running man dirname.

With the power of command substitution, I use the dirname command several times to get the location of the conda environment. You can try it yourself:

command -v centrifuge
dirname $(command -v centrifuge)
dirname $(dirname $(command -v centrifuge))

I used this to help the make command find the appropriate Makefile. If that -C flag wasn't specified, and if there isn't a Makefile in the current working directory, make lets you know that it can't do anything:

$ make
make: *** No targets specified and no makefile found.  Stop.
ramnageena11 commented 1 year ago

thanks for explaining. I appreciate.

ramnageena11 commented 1 year ago

I have run the script make -C "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices \ THREADS=0 IDX_NAME='p_compressed+v' ANY_LEVEL_GENOMES='viral' COMPLETE_GENOMES_COMPRESSED='archaea bacteria'

with 20 threads.

fanninpm commented 1 year ago

Good luck. It may still take a long time.

ramnageena11 commented 1 year ago

This error is coming:

Error downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/865/GCF_000839865.1_ViralProj14134 /GCF_000839865.1_ViralProj14134 _genomic.fna.gz!

Error downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/867/225/GCF_000867225.2_ViralMultiSegProj16738 /GCF_000867225.2_ViralMultiSegProj16738 _genomic.fna.gz! basename: extra operand ‘_genomic.fna.gz’ Try 'basename --help' for more information. basename: extra operand ‘_genomic.fna.gz

fanninpm commented 1 year ago

People have encountered this problem in the past (see #201).

Here's my attempt at fixing it (that may or may not work):

pushd "$(dirname $(command -v centrifuge-download))"
if command -v curl &> /dev/null; then
    curl https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff > patch.diff
elif command -v wget &> /dev/null; then
    wget -O patch.diff https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff
fi
if [[ -f patch.diff ]]; then
    patch -p0 <patch.diff
else
    echo "didn't download patch!"
fi
popd

A few notes:

  1. Here, pushd and popd are shell built-in commands that are a bit like cd but also manipulate the directory stack. (You can use the dirs command to see what's currently in the directory stack.)
  2. GitHub has this nice feature that you can add .patch or .diff for Git's plaintext views. Here' I'm using it on the page that compares the master branch to the v1.0.4 release.
  3. I didn't know if you had curl or wget (or neither), so I used Bash's built-in control flow "commands" to prepare for every eventuality. If you are using another shell (such as zsh), feel free to adapt the control flow for that purpose. (Confused about bash? I'd recommend taking time to read through some of man bash. Be aware that Bash's manpage is really long, so you may want to use the / key to search for certain keywords.)
fanninpm commented 1 year ago

Also, you may want to use make -C "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices clean to clean up dirty directories.

ramnageena11 commented 1 year ago

shall i wait for ongoing script to stop or kill that? Thanks

fanninpm commented 1 year ago

I'd recommend killing it, then cleaning up what it generated so far.

ramnageena11 commented 1 year ago

ok Thanks

ramnageena11 commented 1 year ago

will do it, and proceed as you suggested,

RNS

ramnageena11 commented 1 year ago

Hi, I have wget but not curl.

I ran the whole script but got another error: pushd "$(dirname $(command -v centrifuge-download))" if command -v curl &> /dev/null; then curl https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff > patch.diff elif command -v wget &> /dev/null; then wget -O patch.diff http://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff fi if [[ -f patch.diff ]]; then patch -p0 <patch.diff else echo "didn't download patch!" fi popd bash: syntax error near unexpected token `then'

fanninpm commented 1 year ago

Did the patch download? I'm trying to isolate where my attempt went wrong.

ramnageena11 commented 1 year ago

No, nothing happened. Is the below script is single or 3/4 scripts?

pushd "$(dirname $(command -v centrifuge-download))" if command -v curl &> /dev/null; then curl https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff > patch.diff elif command -v wget &> /dev/null; then wget -O patch.diff https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff fi if [[ -f patch.diff ]]; then patch -p0 <patch.diff else echo "didn't download patch!" fi popd

fanninpm commented 1 year ago

What's the output of dirs?

ramnageena11 commented 1 year ago

Did not get your question?

I ran the script as a single command and got the error of

bash: syntax error near unexpected token `then'

fanninpm commented 1 year ago

Run dirs. What is printed to the screen? (This is so I can determine the current working directory and the directory stack.)

ramnageena11 commented 1 year ago

this is output

(diversity) majorram@majorram-gilbert:~$ dirs ~

fanninpm commented 1 year ago

I think I know what went wrong. When you copy/paste into your terminal, somehow the newlines are lost.

Command 1: Let's switch to the directory that contains centrifuge-download.

pushd "$(dirname $(command -v centrifuge-download))"

Command 2 (simplified from last time): Let's download the patch from GitHub.

wget -O patch.diff https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff

Command 3 (simplified from last time): Let's apply the patch that we just downloaded.

patch -p0 <patch.diff

Command 4: Let's get back to where you were before.

popd
ramnageena11 commented 1 year ago

okay Thanks RNS

ramnageena11 commented 1 year ago

Output for Command1:

$pushd "$(dirname $(command -v centrifuge-download))" ~/anaconda3/envs/diversity/bin ~

output command 2: wget -O patch.diff https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff --2022-08-04 13:05:03-- https://github.com/DaehwanKimLab/centrifuge/compare/v1.0.4...master.diff Resolving github.com (github.com)... 140.82.113.4 Connecting to github.com (github.com)|140.82.113.4|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1951 (1.9K) [text/plain] Saving to: ‘patch.diff’

patch.diff 100%[===============================================================================================================>] 1.91K --.-KB/s in 0s

2022-08-04 13:05:03 (25.7 MB/s) - ‘patch.diff’ saved [1951/1951]

ramnageena11 commented 1 year ago

output for third command: still running... patch -p0 <patch.diff can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was:

|diff --git a/centrifuge-download b/centrifuge-download |index cae1bcd..aaa72cf 100755 |--- a/centrifuge-download |+++ b/centrifuge-download

File to patch:

fanninpm commented 1 year ago

I was afraid of that. Type

centrifuge-download

at that prompt.

ramnageena11 commented 1 year ago

after centrifuge-download


File to patch: centrifuge-download patching file centrifuge-download Hunk #1 succeeded at 363 (offset 1 line). finished

fanninpm commented 1 year ago

Then you can use popd to get back to where you were before.

ramnageena11 commented 1 year ago

done

ramnageena11 commented 1 year ago

what should i do next?

fanninpm commented 1 year ago

Try that whole make command again.

ramnageena11 commented 1 year ago

ok

ramnageena11 commented 1 year ago

Hi, Pls see this:

make -C "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices THREADS=24 IDX_NAME='p_compressed+v' ANY_LEVEL_GENOMES='viral' COMPLETE_GENOMES_COMPRESSED='archaea bacteria fungi' make: Entering directory '/home/majorram/anaconda3/envs/diversity/share/centrifuge/indices' mkdir -p reference-sequences [[ -d tmp_p_compressed+v ]] && rm -rf tmp_p_compressed+v; mkdir -p tmp_p_compressed+v Downloading and dust-masking viral centrifuge-download -o tmp_p_compressed+v -m -a "Any" -d "viral" -P 24 refseq > \ tmp_p_compressed+v/all-viral-any_level.map Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ... Downloading 11699 viral genomes at assembly level Any ... (will take a while) dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Progress : [----------------------------------------] 0% 2/11699dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Progress : [----------------------------------------] 0% 7/11699dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory

ramnageena11 commented 1 year ago

make -C "$(dirname (command -v centrifuge)))"/share/centrifuge/indices THREADS=24 IDX_NAME='p_compressed+v' ANY_LEVEL_GENOMES='viral' COMPLETE_GENOMES_COMPRESSED='archaea bacteria fungi'

Does it anything with " archaea bacteria fungi"? I have added fungi here also

fanninpm commented 1 year ago

What happens when you kill the previous invocation, and you only make a viral database?

make -f "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices/Makefile THREADS=0 v
fanninpm commented 1 year ago

I have a hunch as to why you might be getting the error with libssl.so.1.0.0.

I find that the simplest way to solve this kind of problem is by re-making your Conda environment from scratch using a YAML file. Here is an example for that YAML file:

name: rename-me-with-whatever-you-want
channels:
  - conda-forge
  - defaults
  - bioconda
dependencies:
  - centrifuge
  - blast
  - kmer-jellyfish
  - mummer
  - make

(Please note that the order of channels matters. conda-forge needs to be specified first in order to avoid specific cryptic error messages.)

Installing mamba may also help with some dependency resolution issues, as mamba has a faster and more robust dependency resolver than conda.

CAUTION: after this, you will have to redo those patching steps I guided you through earlier.

ramnageena11 commented 1 year ago

What happens when you kill the previous invocation, and you only make a viral database?

make -f "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices/Makefile THREADS=0 v

PLs see the output: ake -f "$(dirname $(dirname $(command -v centrifuge)))"/share/centrifuge/indices/Makefile THREADS=20 v Making: v: v make -f /home/majorram/anaconda3/envs/diversity/share/centrifuge/indices/Makefile IDX_NAME=v make[1]: Entering directory '/home/majorram' mkdir -p reference-sequences [[ -d tmp_v ]] && rm -rf tmp_v; mkdir -p tmp_v Downloading and dust-masking viral centrifuge-download -o tmp_v -m -a "Any" -d "viral" -P 20 refseq > \ tmp_v/all-viral-any_level.map Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ... Downloading 11699 viral genomes at assembly level Any ... (will take a while) dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Progress : [----------------------------------------] 0% 1/11699dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Progress : [----------------------------------------] 0% 2/11699dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Progress : [----------------------------------------] 0% 3/11699dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory dustmasker: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Progress : [----------------------------------------] 0% 4/11699dustmasker: error while loading shared libraries: