eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
557 stars 105 forks source link

temp file problem #286

Closed Wanli-HE closed 3 years ago

Wanli-HE commented 3 years ago

hi!

i am using new version of eggong-mapper. -m mmseq, it raises an error, so i run mmseq separately, the command like this:

/home/projects/ku_00041/data/testpplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/sitepackages/eggnogmapper/bin/mmseqs search -a true

/home/projects/ku_00041/archive/gut_sample_result/circular_non_readundant_gene0.55/emappertmp_mmseqs_wb_b19ph/51e406339f9b4c65a36c2cc8f64af1cd

/home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/mmseqs/mmseqs.db

/home/projects/ku_00041/archive/gut_sample_result/circular_non_readundant_gene-0.55/emappertmp_mmseqswb b19ph/5b11e19723d04baa9a2576192978d7dc

/home/projects/ku_00041/archive/gut_sample_result/circular_non_readundant_gene-0.55/emappertmp_mmseqs_wb_b19ph --start-sens 3 --sens-steps 3 -s 7 --threads 35

and i get the error:

Input /home/projects/ku_00041/archive/gut_sample_result/circular_non_readundant_gene-0.55/emappertmp_mmseqs_wb_b19ph/51e406339f9b4c65a36c2cc8f64af1cd does not exist.

i think it is because of the temp file in behind caused this problem.

is that true? and how to solve it

Wanli-HE commented 3 years ago

also with "-m diamond"

command line:

/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/bin/diamond blastx -d /home/pro jects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/eggnog_proteins.dmnd -q /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/li near_non_redundant_gene/linear_non_redundant_genes.fa --threads 35 -o /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_no n_redundant_gene/emappertmp_dmdn_i5tdga_a/528f134e919d4ee9b9b0bd7f9950e557 --sensitive -e 0.001 --max-target-seqs 0 --max-hsps 0 --outfmt 6

result: diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org

CPU threads: 35

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: No such file or directory Error: Error opening temporary file /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/emappertmp_dmdn_i5tdga_a/diamond-tmp-h2Ox1T

Cantalapiedra commented 3 years ago

Hi @Wanli-HE ,

you should send the output to a directory which actually exists. I am not sure whether those "emappertmp_" directories exist. They are usually created by emapper and removed afterwards. Although it is true that when emapper crashes sometimes they remain in place.

What error do you get when running diamond and/or mmseqs from emapper.py?

Best, Carlos

Wanli-HE commented 3 years ago

Hi @Wanli-HE ,

you should send the output to a directory which actually exists. I am not sure whether those "emappertmp_" directories exist. They are usually created by emapper and removed afterwards. Although it is true that when emapper crashes sometimes they remain in place.

What error do you get when running diamond and/or mmseqs from emapper.py?

Best, Carlos

Hi! Thanks for your answering,

here is the error

emapper-2.0.8-2

emapper.py -m diamond -i linear_non_redundant_genes.fa --itype metagenome -o all-gene --cpu 35 --data_dir /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2 --dmnd_db /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/eggnog_proteins.dmnd --temp_dir .

ESC[1;33m /home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/bin/diamond blastx -d /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/eggnog_proteins.dmnd -q /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/linear_non_redundant_genes.fa --threads 35 -o /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/emappertmp_dmdn_0qin8h8k/adbd0fc5432342659f1d09d8ada3197f --sensitive -e 0.001 --max-target-seqs 0 --max-hsps 0 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhspESC[0m Error running diamond: Computing alignments...

Wanli-HE commented 3 years ago

Deallocating buffers... [0.435s] Clearing query masking... [0.47s] Opening temporary output file... [0.006s] Computing alignments... /var/spool/torque/mom_priv/jobs/30997971.SC: line 33: 10280 Killed diamond blastp -d /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/eggnog_proteins.dmnd -q linear_gene_prodigal_protein_seq.faa --threads 35 -o diamondres --sensitive -e 0.001 --max-target-seqs 0 --max-hsps 0 --outfmt 6 --no-unlink

here is diamond running problem

Cantalapiedra commented 3 years ago

Sincerely I have no idea what is going on. Could be some memory limit you have in your computer/nodes?

You could try running:

/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/bin/diamond blastx -d /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/eggnog_proteins.dmnd -q /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/linear_non_redundant_genes.fa --threads 35 -o /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/test_diamond_out_dir --sensitive -e 0.001 --max-target-seqs 0 --max-hsps 0 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp

and report what happens.

Wanli-HE commented 3 years ago

Sincerely I have no idea what is going on. Could be some memory limit you have in your computer/nodes?

You could try running:

/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/bin/diamond blastx -d /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/eggnog_proteins.dmnd -q /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/linear_non_redundant_genes.fa --threads 35 -o /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/test_diamond_out_dir --sensitive -e 0.001 --max-target-seqs 0 --max-hsps 0 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp

and report what happens.

hi! is memory problem! i had solved it.

thanks!

by the way, did you any idea about, normally, how long it will take for diamond annotation for a file 250Mb nucl sequence!

Cantalapiedra commented 3 years ago

Hi @Wanli-HE ,

glad that you solved it!

I see that you are using --itype metagenome with diamond. As you can read here: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.0.2-v2.0.8#Gene_Prediction_Options I would not recommend using diamond blastx for large assembled contigs or for genomes, since it could take a long time to complete.

A maybe better, likely much faster, approach would be using -m diamond --itype metagenome --genepred prodigal. Like this, diamond would perform the search using as queries the proteins predicted by prodigal.

Another approach would be using MMseqs2 instead of diamond when using --itype metagenome.

Diamond blastx could be good to search CDS on small contigs (expected to bear a single CDS, for example), out of frame CDS, or a few contigs for which you wish to confirm the CDS detected by prodigal, for instance.

I hope this makes sense.

Best, Carlos

Wanli-HE commented 3 years ago

--itype metagenome.

ok! thanks! i will try to do that!

Wanli-HE commented 3 years ago

hi Carlos!

when i using mmseqs to annotation genes, the command line like blow:

emapper.py -m mmseqs -i linear_non_redundant_genes.part-001.fa --itype CDS --translate -o genes.part-001 --cpu 35 --data_dir /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2 --mmseqs_db /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/mmseqs --temp_dir .

but it raising an error.

OSError: [Errno 39] Directory not empty: '/home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_6_zhui6x'

what was happend?

Cantalapiedra commented 3 years ago

Hi @Wanli-HE ,

could you paste the whole output from emapper, please? To try to understand in which step the error is produced.

Thank you.

Best, Carlos

Wanli-HE commented 3 years ago

Hi @Wanli-HE ,

could you paste the whole output from emapper, please? To try to understand in which step the error is produced.

Thank you.

Best, Carlos

here is the output:

Working directory is /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96

emapper-2.0.8-2

emapper.py -m mmseqs -i linear_non_redundant_genes.part-003.fa --itype CDS --translate -o genes.part-003 --cpu 35 --data_dir /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2 --mmseqs_db /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/mmseqs --temp_dir .

ESC[1;33m /home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/bin/mmseqs createdb /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y/tmpish70mtq /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y/24b1df109a5f4d28b0cae926745f1559 --dbtype 1ESC[0m

ESC[1;33m /home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/bin/mmseqs search -a true /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y/24b1df109a5f4d28b0cae926745f1559 /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/mmseqs /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y/d67f1a13dc4f419cb1eb66e75cc22616 /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y --start-sens 3 --sens-steps 3 -s 7 --threads 35ESC[0m

here is the error:

Traceback (most recent call last): File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs.py", line 207, in search_step completed_process = subprocess.run(cmd, capture_output=True, check=True, shell=True) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/bin/mmseqs search -a true /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y/24b1df109a5f4d28b0cae926745f1559 /home/projects/ku_00041/data/test-pplaa/Plaspline/db/EggNOGV2/mmseqs /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y/d67f1a13dc4f419cb1eb66e75cc22616 /home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y --start-sens 3 --sens-steps 3 -s 7 --threads 35' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs.py", line 145, in _search raise e File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs.py", line 140, in _search alignmentsdb, cmds = self.run_mmseqs(in_file, tempdir, querydb, self.targetdb, resultdb, bestresultdb) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs.py", line 163, in run_mmseqs cmd = self.search_step(querydb, targetdb, resultdb, tempdir) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs.py", line 209, in search_step raise EmapperException("Error running 'mmseqs search': "+cpe.stderr.decode("utf-8").strip().split("\n")[-1]) eggnogmapper.emapperException.EmapperException: Error running 'mmseqs search': Current input: Generic. Allowed input: Index, Nucleotide, Pro file, Aminoacid

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/bin/emapper.py", line 664, in emapper.run(args, args.input, args.annotate_hits_table, args.cache_file) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/emapper.py", line 205 , in run searcher = self.search(args, infile, predictor) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/emapper.py", line 131 , in search searcher.search(queries_file, File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs. py", line 123, in search return self._search(in_file, seed_orthologs_file) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/site-packages/eggnogmapper/search/mmseqs/mmseqs. py", line 147, in _search shutil.rmtree(tempdir) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/shutil.py", line 722, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/home/projects/ku_00041/data/test-pplaa/Plaspline/conda_envs/759e9edb/lib/python3.9/shutil.py", line 720, in rmtree os.rmdir(path) OSError: [Errno 39] Directory not empty: '/home/projects/ku_00041/archive/gut_sample_result/bacteri-gene/linear_non_redundant_gene/split-96/emappertmp_mmseqs_2kng9e8y'

i searched in web, it maybe the problem with "rm -rf " command

Wanli-HE commented 3 years ago

and also one problem, i split my cds genes.fa file into small part, origin is about 6Gb, and split 400 sub-file, and each is 16M. and i try one using diamond blasp, but it still need long time, 35 cpu, and running time over 200 cpu hours. still not finished. so what is in behind of this command, is this normal?

Cantalapiedra commented 3 years ago

Hi @Wanli-HE ,

regarding the MMseqs2 error, it is showing this:

raise EmapperException("Error running 'mmseqs search': "+cpe.stderr.decode("utf-8").strip().split("\n")[-1])
eggnogmapper.emapperException.EmapperException: Error running 'mmseqs search': Current input: Generic. Allowed input: Index, Nucleotide, Pro
file, Aminoacid

It is detecting the input as "Generic". One reason could be that your fasta files were not correctly formatted? Or it could be that there is some bug affecting your input when translating by emapper, etc. Please, check that your files are correct so that we can discard that.

Regarding the timings using diamond blastp, how many sequences do you have in each sub-file?

Cantalapiedra commented 3 years ago

Closing this issue. Feel free to re-open or re-issue.

Best, Carlos