continue from where is break up when run emapper.py

li0604 commented 3 years ago

Hi everyone,

Because of the input file is large, if I could add some parameters to continue from where is break up.

Cantalapiedra commented 3 years ago

Hi @li0604 ,

From which step would you like to resume? If the search finished entirely you could run only the annotation step with -m no_search --annotate_hits_table seed_orthologs.file

li0604 commented 3 years ago

Dear professor, I am gald to hear from you. The annotation step was put a break. Could it be resumed? Thanks a lot! Yours, Qingmei

在 2020-11-20 15:14:09，"Carlos P Cantalapiedra" notifications@github.com 写道：

Hi @li0604 ,

From which step would you like to resume? If the search finished entirely you could run only the annotation step with -m no_search --annotate_hits_table seed_orthologs.file

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Cantalapiedra commented 3 years ago

Hi Qingmei,

what are the contents of your output folder?

li0604 commented 3 years ago

It is in this form: output folder name is : output_file.emapper.annotations

part of contents:

k10_contig_1_1 | 335992.SAR11_0510 | 5.40E-19 | 100.1 | unclassified | Alphaproteobacteria | glc | 2.3.3.9 | ko:K0163ko00620,ko00630,ko01100,ko01110,ko01120,ko01200,map00620,map00630,map01100,map01110,map01120,map01200 | M00012 | R00472 | RC00004,RC00308,RC02747 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MVEV@1224,2TS9R@28211,4 | PUK@82117,COG2225@1,COG2225@2 | NA|NA|NA | C | Involved | in | the | glycolate | utilization. | Catalyzes | the | condensation | and | subsequent | hydrolysis | of | acetyl-coenzyme | A | (acetyl-CoA) | and | glyoxylate | to | form | malate | and | CoA k10_contig_1_2 | 1400524.KL370779_gene592 | 9.60E-26 | 122.5 | unclassified | Alphaproteobacteria | accA | 2.1.3.15,6.4.1.2 | ko:K01962,ko:K01963 | ko00061,ko00620,ko00640,ko00720,ko01100,ko01110,ko01120,ko01130,ko01200,ko01212,map00061,map00620,map00640,map00720,map01100,map01110,map01120,map01130,map01200,map01212 | M00082,M00376 | R00742,R04386 | RC00040,RC00253,RC00367 | ko00000,ko00001,ko00002,ko01000 | acteria | 1MURN@1224,2TR6V@28211,4 | P8Q@82117,COG0825@1,COG0825@2 | NA|NA|NA | I | Component | of | the | acetyl | coenzyme | A | carboxylase | (ACC) | complex. | First, | biotin | carboxylase | catalyzes | the | carboxylation | of | biotin | on | its | carrier | protein k10_contig_2_1 | 857087.Metme_0730 | 7.50E-84 | 316.6 | Methylococcales | tldD | GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO:0005829,GO:0006508,GO:0006807,GO:0008150,GO:0008152,GO:0019538,GO:0043170,GO:0044238,GO:0044424,GO:0044444,GO:0044464,GO:0071704,GO:1901564 | ko:K03568 | ko00000,ko01002 | acteria | 1MUSK@1224,1RMA5@1236,1XE3Y@135618,COG0312@1,COG0312@2 | NA|NA|NA | S | modulator | of | DNA | gyrase

Cantalapiedra commented 3 years ago

Hi,

unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement. With current versions you could:

Just re-run the annotation step using "-m no_search --annotate_hits_table output_file.emapper.annotations"
Create a new seed orthologs file removing the entries which are already within output_file.emapper.annotations, and run the previous command with just those entries. You could try something like:

join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ... cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations rm remaining_seed_orthologs remaining.emapper.annotations output_file.emapper.annotations

I hope this helps.

Best, Carlos

li0604 commented 3 years ago

Dear profesor, That's very kind of you. I think this suggestion is helpful for me. Thanks a lot!

Yours sincerely, Qingmei.

At 2020-11-24 16:34:56, "Carlos P Cantalapiedra" notifications@github.com wrote:

Hi,

unfortunately there is no way to directly resume the annotation step, although it would be a nice feature to implement. With current versions you could:

Just re-run the annotation step using "-m no_search --annotate_hits_table output_file.emapper.annotations" Create a new seed orthologs file removing the entries which are already within output_file.emapper.annotations, and run the previous command with just those entries. You could try something like:

join -v 1 -t $'\t' <(grep -v "^#" seed_orthologs_file | sort) <(grep -v "^#" annotations_file | sort) | cut -f 1-4 > remaining_seed_orthologs emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ... cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations rm remaining_seed_orthologs remaining.emapper.annotations output_file.emapper.annotations

I hope this helps.

Best, Carlos

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Cantalapiedra commented 3 years ago

Glad to hear that. Let's see if we can implement the --resume option for annotations anytime soon.

Best, Carlos

li0604 commented 3 years ago

If --resume option can be realized, thst is wonderful.

Best, Qingmei.

At 2020-11-24 17:00:44, "Carlos P Cantalapiedra" notifications@github.com wrote:

Glad to hear that. Let's see if we can implement the --resume option for annotations anytime soon.

Best, Carlos

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Sofie8 commented 3 years ago

Hi, I am also in favour of a --resume option for the annotations step. What is the --resume option 'resuming?' currently? I thought it was 'resuming' the annotation step, but after 72 hours of run-time, I exceeded my walltime, and with restarting it, it just erased everything :-( (and goodbye to my computing credits..). I am using the emapper.py as implemented in atlas. We split in subsets of 500,000, but on a single machine 36 threads, 198 Gb MEM, 72h is not enough to finish 1 subset... Does it scale well with more threads, or do you have a suggestion how I specify my jobs best? I have also one big mem node available, 36 threads, 760 Gb Ram, or an AMD, 64 nodes 256 Gb Ram. Thanks!

Cantalapiedra commented 3 years ago

Hi @Sofie8 ,

sorry to hear that about your computing credits. I am not sure what atlas is. The --resume option is a somewhat old option used for hmmer searches. No actual resume option for diamond, mmseqs or annotation steps.

Besides that, I would recommend not only splitting the dataset, but also the emapper steps, when running large datasets. Not sure if you are doing it already. It would be something like (depending on emapper version):

emapper -m diamond -i input.fasta -o test --output_dir outdir emapper -m no_search --annotate_hits_table outdir/test.emapper.seed_orthologs -o test --output_dir outdir

The more threads the faster is, usually. Also, in the nodes with 256GB or greater than that you could use -m mmseqs instead of -m diamond, which should be faster, if your emapper version includes the mmseqs option.

Also, in the latest versions there is an option to load the annotation DB into memory (--dbmem), which should speed up the annotation step quite a bit. See https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*#Annotation_Options

Best, Carlos

Sofie8 commented 3 years ago

Hi @Cantalapiedra ,

Thanks for your answer!

Yes, atlas is the metagenome analyses pipeline from Silas https://github.com/metagenome-atlas/atlas/issues/351

I did now the eggnog annotation step outside of atlas: emapper.py --annotate_hits_table Genecatalog/subsets/genes/remaining_seed_orthologs \ --no_file_comments --resume -o Genecatalog/subsets/genes/subset2 --cpu 36 \ --data_dir /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2 2>> Genecatalog/subsets/genes/logs/subset2/eggNOG_annotate_hits_table.log

following this to split the file and joining them back together: join -v 1 -t $'\t' <(grep -v "^#" subset2.emapper.seed_orthologs | sort) <(grep -v "^#" subset2.emapper.annotations | sort) | cut -f 1-4 > remaining_seed_orthologs emapper.py -m no_search --annotate_hits_table remaining_seed_orthologs -o remaining ... cat output_file.emapper.annotations remaining.emapper.annotations > all.emapper.annotations

But after cat, combine_egg_nogg_annotations says: Error Expected 22 fields in line 320861, saw 64

Is cat doing something with a blank line between the two files or putting things on 1 line? It is exactly there where the two files merge..

Ok for the other suggestions! @SilasK can this be useful for further improving the genecatalog step? Note also that --resume, is not resuming, so if the annotation step is broken in atlas, or not completed, it starts allover again to do the annotation. So in my case I have to split into smaller subsets than 500.000 to finish it in 3 days (36 threads, 198 Gb Ram). In the latest release:

the option --mmseqs
the option: --dbmem

Best, Sofie

SilasK commented 3 years ago

See my respons on: https://github.com/metagenome-atlas/atlas/issues/351

Cantalapiedra commented 3 years ago

Hi,

just remind that the --dbmem option would need around 40GB of free mem, and that using mmseqs would require downloading the corresponding eggnog-mapper mmseqs database (using the download script), and that such option (--mmseqs) requires a lot of memory to run.

Therefore, in both cases it is recommended running less jobs and more sequences per job, and the number of jobs per computer or cluster node should be set according to the memory available.

Best, Carlos

SilasK commented 3 years ago

Is the mmseqs version already officially released? I didn't know that. How much memory does mmseqs use? Do you use profiles or search mode?

If I'm not mistaken during the emapper.py --annotate_hits_table you don't use mmseqs do you?

Cantalapiedra commented 3 years ago

Hi,

MMseqs can be used for the search step in the "refactor" branch, which we hope to merge soon with the "master" one. https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2-*refactor*

So far we are using search mode. To be honest, I don't know exactly how much (peak) memory is using. I am currently running some jobs in nodes with 236GB, and seems to be enough. For less than 200GB I would use diamond or hmmer in server mode.

You are right, the --annotate_hits_table (along with -m no_search in the refactor version) is used to run the annotation step without running the previous search step, so no MMseqs (nor diamond) involved there. The "--dbmem" is used during the annotation step though, and using less than 40GB allows loading the sqlite3 DB into memory before annotating (which could be convenient to replace the use of /dev/shm).

Best, Carlos

Cantalapiedra commented 3 years ago

--resume currently resumes most of emapper stages, since version 2.1.0

eggnogdb / eggnog-mapper

continue from where is break up when run emapper.py #249