Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
477 stars 82 forks source link

Error during pplacer #170

Closed morgvevans closed 4 years ago

morgvevans commented 5 years ago

Having issues with pplacer step - it seems everything else is working, and the check install step works just fine so I don't think it's an install issue...

I have tried giving the command different amounts of cpus, so I don't think it's a memory thing? As you can see it runs for about 38 minutes before the error message, this doesn't change regardless of the amount of cpus I give it.

(gtdb_env_2) -bash-4.2$ gtdbtk classify_wf --genome_dir ./metawrap/Shale/Bin_classify_GTDB/bin_one/ --out_dir ./metawrap/Shale/binone_output_072319_2 -d --force --cpus 64 --force [2019-07-23 21:49:26] INFO: GTDB-Tk v0.3.2 [2019-07-23 21:49:26] INFO: gtdbtk classify_wf --genome_dir ./metawrap/Shale/Bin_classify_GTDB/bin_one/ --out_dir ./metawrap/Shale/binone_output_072319_2 -d --force --cpus 64 --force [2019-07-23 21:49:26] INFO: Using GTDB-Tk reference data version r89: /users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/share/gtdbtk-0.3.2/db/ [2019-07-23 21:49:26] INFO: Identifying markers in 1 genomes with 64 threads. [2019-07-23 21:49:26] INFO: Running Prodigal to identify genes. ==> Finished processing 1 of 1 (100.0%) genomes. [2019-07-23 21:49:37] INFO: Identifying TIGRFAM protein families. ==> Finished processing 1 of 1 (100.0%) genomes. [2019-07-23 21:49:43] INFO: Identifying Pfam protein families. ==> Finished processing 1 of 1 (100.0%) genomes. [2019-07-23 21:49:44] INFO: Done. [2019-07-23 21:49:48] INFO: Aligning markers in 1 genomes with 64 threads. [2019-07-23 21:49:48] INFO: Processing 1 genomes identified as bacterial. [2019-07-23 21:49:57] INFO: Read concatenated alignment for 23458 GTDB genomes. [2019-07-23 21:50:15] INFO: Masking columns of multiple sequence alignment using canonical mask. [2019-07-23 21:51:18] INFO: Masked alignment from 41155 to 5040 AAs. [2019-07-23 21:51:18] INFO: 0 user genomes have amino acids in <10.0% of columns in filtered MSA. [2019-07-23 21:51:18] INFO: Creating concatenated alignment for 23459 GTDB and user genomes. [2019-07-23 21:51:19] INFO: Creating concatenated alignment for 1 user genomes. [2019-07-23 21:51:19] INFO: Done. [2019-07-23 21:51:19] INFO: Placing 1 bacterial genomes into reference tree with pplacer (be patient). [2019-07-23 22:29:25] ERROR: An error was encountered while running pplacer. [2019-07-23 22:29:25] ERROR: Controlled exit resulting from an unrecoverable error or warning.

================================================================================ EXCEPTION: PplacerException MESSAGE:


Traceback (most recent call last): File "/users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/bin/gtdbtk", line 452, in gt_parser.parse_options(args) File "/users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/lib/python2.7/site-packages/gtdbtk/main.py", line 602, in parse_options self.classify(options) File "/users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/lib/python2.7/site-packages/gtdbtk/main.py", line 415, in classify options.debug) File "/users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/lib/python2.7/site-packages/gtdbtk/classify.py", line 320, in run scratch_dir) File "/users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/lib/python2.7/site-packages/gtdbtk/classify.py", line 146, in place_genomes pplacer.run(self.cpus, 'WAG', pplacer_ref_pkg, pplacer_json_out, user_msa_file, pplacer_out, pplacer_mmap_file) File "/users/PAS1331/osu7930/miniconda3/envs/gtdb_env_2/lib/python2.7/site-packages/gtdbtk/external/pplacer.py", line 61, in run raise PplacerException(proc_err) PplacerException

aaronmussig commented 5 years ago

Hello,

There is a known issue with pplacer where using a high number of threads can cause issues. Primarily those issues are related to the host thinking more memory is being used than it actually is. I'd recommend running it again with a smaller number of threads ~10-30.

There may be more information available in the pplacer log file which you can find in the output directory under: classify/intermediate_results/pplacer/pplacer.[marker_set].out

morgvevans commented 5 years ago

Thank you for the quick reply.

I'll try to run with fewer threads and get back to you on if it worked or not.

Here's the output of the file you listed in the meantime -

Running pplacer v1.1.alpha19-0-g807f6f3 analysis on ./metawrap/Shale/binone_output_072319_2/align/gtdbtk.bac120.user_msa.fasta... Didn't find any reference sequences in given alignment file. Using supplied reference alignment. Pre-masking sequences... sequence length cut from 5040 to 4646. Determining figs... figs disabled. Allocating memory for internal nodes... done. Caching likelihood information on reference tree...

morgvevans commented 5 years ago

I am still getting the same error when I use --cpus 1

morgvevans commented 5 years ago

OK i ran a test genome that is in a slightly different format- it is an annotated .fna file - and it worked no problem! The genomes I have been trying to run are genomic bins that are in .fna format, I can't quite figure out why I'm having a hard time w/ them. I have included one of the files here if anyone can take a look and help. The only thing I can think is there is either a weird character, or some type of weird formatting issue. bin.1.orig.zip

morgvevans commented 5 years ago

More helpful info to add to the last message as well- The bins were generated using MetaWRAP and annotated using PROKKA I was originally using the raw, unannotated file (see above) I've also tried the annotated files with no luck.... bin.1.orig.annotated.zip

Thanks !

morgvevans commented 5 years ago

I was able to get all these bins to run in GTDB in KBase, so it must be an issue w/ my installation (which was done through bioconda). I will try and get the command line version to work for future jobs that require it maybe by installing from pip instead of bioconda? If anyone has any ideas please let me know.

vinisalazar commented 5 years ago

I'm having this same problem with 1 thread. So far I couldn't get it to work.

The gtdbtk test command runs smoothly, though.

aaronmussig commented 5 years ago

Thanks for the detailed feedback and the genomes, it was very useful for testing. Unfortunately, I haven't been able to replicate the issue trying either: pip install, conda environment, or a manual build.

What I've observed is:

Given pplacer is failing on the caching step, I would be inclined to say that the server has insufficient memory. Can I ask how much RAM the server has?

If you do have sufficient memory then I'll do a bit of digging into if conda has any sort of upper memory limit.

aaronmussig commented 5 years ago

I tried this again with capping the maximum amount of memory using ulimit and the exception thrown did contain.

EXCEPTION: PplacerException
  MESSAGE: Uncaught exception: Out of memory
Fatal error: exception Out_of_memory

Perhaps ulimit works differently and I get a more detailed output... or perhaps memory isn't the issue, it's still worth testing.

I would also try running the command manually, you will need to change it a bit but it should be something like this: pplacer -m WAG -j 1 -c $GTDBTK_DATA_PATH/pplacer/gtdb_r89_bac120.refpkg -o /tmp/pplacer.bac120.json ~/classify_wf/align/gtdbtk.bac120.user_msa.fasta

That way we can at least validate that it's pplacer which is the issue and maybe capture any additional output which is being missed.

Handymanalan commented 5 years ago

I am having the same problem, the test genomes worked but not my own MAGs.

Edit: It seems to work only with archaeal MAGs but not with bacterial MAGs.

aaronmussig commented 5 years ago

@Handymanalan are you able to get this to run manually using the pplacer command listed above? Additionally, how much memory is on the server?

xwu35 commented 5 years ago

I also got the same error from both pip and bioconda installed version. Then I tried to run pplacer manually as indicated above, it also did not work

aaronmussig commented 5 years ago

Hi @xwu35, what did pplacer output when it failed? It sounds like this is a problem specific with pplacer, I just need to confirm that the input isn't malformed. Additionally, how much memory is on the server?

xwu35 commented 5 years ago

Hi, the code i used was: pplacer -m WAG -j 6 -c /lustre/haven/user/xwu35/database/release89/pplacer/gtdb_r89_bac120.refpkg -o /tmp/pplacer.bac120.json gtdbtk_good_bins_output/align/gtdbtk.bac120.user_msa.fasta

and the output was like this: Pre-masking sequences... sequence length cut from 5040 to 5040. Determining figs... figs disabled. Allocating memory for internal nodes... done Caching likelihood information on reference tree... kill

But the weird thing is that I just ran pplacer and gtdbtk from both versions again and they all worked...We have 192 GB of RAM per node on server. Thanks

aaronmussig commented 5 years ago

@xwu35 Given that the caching likelihood information step is when pplacer loads 100+GB of data into memory it sounds like it ran out of memory. Unfortunately, pplacer isn't awfully descriptive with its error messages.

192 GB of RAM is sufficient to run GTDB-Tk, however, there may be some insights into how pplacer operates on HPCs which you can read about in issue #124

Thanks, Aaron

micro-phia commented 5 years ago

I'd like to record an instance of this same issue:

I used an HPC slurm system to run the following gtdbtk command within a for-loop, in which ${COM} represents a different community, hosting the necessary genome folder ('MAG-genomes') as an input. (There are 11 ${COM} folders, each with ~30 MAGS in .fa format. So, essentially, gtdbtk should be operating on ~30 MAGS at a time.)

gtdbtk classify_wf --genome_dir ./${COM}/CHECKM/MAG-genomes/ --out_dir ./${COM}/GTDB -x fa

I tried several iterations of this code with the following HPC parameters to accomodate for memory requirements (after finding this forum):

I also tried varying the assigned CPU's using the --cpus flag in the gtdbtk command.

In all instances, the gtdbtk command was able to complete everything except the 'classify' step, (before moving on to the next community in the for-loop.) At the classify step of the program, pplacer was only able to place and classify the archael genomes ... when pplacer attempted to run on the bacterial genomes, it crashed with the following output messages (which demonstrate the ability to place archael genomes but not bacterial).

[2019-10-23 14:05:44] INFO: Placing 2 archaeal genomes into reference tree with pplacer (be patient). [2019-10-23 14:06:39] INFO: Calculating average nucleotide identity using FastANI. [2019-10-23 14:06:43] INFO: 0 genomes have been classify using FastANI and pplacer. [2019-10-23 14:06:43] INFO: Calculating RED values based on reference tree. [2019-10-23 14:06:43] INFO: Placing 16 bacterial genomes into reference tree with pplacer (be patient). [2019-10-23 14:16:05] ERROR: An error was encountered while running pplacer. [2019-10-23 14:16:06] ERROR: Controlled exit resulting from an unrecoverable error or warning.

================================================================================ EXCEPTION: PplacerException MESSAGE:

This error was corroborated by only the presence of the archael final output files, but not bacterial.

I have not yet found a valid solution, since my HPC allotment is capped at 4 nodes. Running this in Kbase is hardly an option, since the Kbase app seems to require a separate run for each genome and I have almost 300 MAGS to work with.

Any updates or bug-fixes would be HUGELY appreciated.

Best, Phia

donovan-h-parks commented 5 years ago

Hi Phia. Best we can tell this is an issue with how pplacer interacts with some HPC environments. The end result being that it appears pplacer is requesting increasing memory for each additional CPU though in reality this isn't the case. Unfortunately, we do not have a solution to this. I think your options are either to run GTDB-Tk with only a few CPUs in order to keep the apparent memory request within reason or to use KBase. You can create a genome set at KBase and process multiple genomes at once through GTDB-Tk.

A major part of the time requires by GTDB-Tk is loading in the reference tree. As such, it is far more efficient to process all MAGs at once (i.e. combining the MAGs from all your communities into a single job for processing).

micro-phia commented 5 years ago

Thank you for your quick response, dparks. I have two follow-up questions:

1) When running this remotely on an HPC system, I am still unable to surpass the pplacer memory issue when I run with only 1CPU on 1 node (64GB) ... and run into the same issue with 1 assigned CPU on 2-4 nodes (128-256GB). I'm now understanding that the program requires 100GB but does not understand how to use multiple nodes/CPU's for that extra memory so the additional nodes don't really help ... I plan to request access to two alternative HPC systems where I can run on one 256GB node or a different, single, 512GB node. Do you expect that either of these single, big-memory nodes will solve the problem?

2) Thank you for the tip on creating a genome set at KBase - however, it appears there is no method to transfer genomes in batch from the "staging area" to the "narrative" of KBase. This means that each of my 300 genomes would still need to be imported individually, even if the "upload" and "gtdbtk" apps can be performed on batch files. Do you know of a work-around?

Thank you in advance, Phia

donovan-h-parks commented 5 years ago

Hi Phia. GTDB-Tk requires a machine with access to 128 GB of RAM. Sounds like each of your nodes only has 64 GB of RAM which is not sufficient. I am not familiar with KBase. Perhaps you can send a help request to them regarding how to upload multiple genomes.

aaronmussig commented 4 years ago

Closing this issue as I believe it's been resolved, the FAQ page has been updated with a summary of what we've learned. Feel free to re-open if that's not the case.

There are two main issues which were identified in this issue:

  1. pplacer failed because it was run on a server with insufficient memory.
  2. pplacer failed because it was run on a HPC/queueing system with multiple threads - which mislead the OS into thinking it was out of memory.

Issue 1 should be easily identifiable on GTDB-Tk 1.0.2 as a warning will be displayed if the server has insufficient memory.

Issue 2 is available for reference in the FAQ, additionally, overriding the number of threads pplacer can use will be available as a feature in the next release (#195).

micro-phia commented 4 years ago

Thank you Aaron - and I apologize for my lack of response. I consulted with your team at office hours, and was able to resolve the issue using a bigmem cluster.

Sincerely, Sophia

On Tue, Dec 17, 2019 at 10:08 PM Aaron notifications@github.com wrote:

Closed #170 https://github.com/Ecogenomics/GTDBTk/issues/170.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecogenomics/GTDBTk/issues/170?email_source=notifications&email_token=ANSQL2FDCAXCQ3C6QQZBFOLQZG46PA5CNFSM4IGLSWXKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOVRSTIBQ#event-2892313606, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANSQL2EQVV6QHBKE2EXDH7DQZG46PANCNFSM4IGLSWXA .

-- Sophia D. Ewens Ph.D Candidate in Microbiology

John D. Coates Laboratory | Website http://coateslab.berkeley.edu/ Department of Plant & Microbial Biology University of California, Berkeley

marcomeola commented 4 years ago

On a side note. It would be really helpful if the standard conda installation would be set on version 1.2.0 of gtdbtk.

saad272 commented 2 years ago

Hi I got the following error while running classify_wf and I don't understand why? help me please. Thank you !

2022-10-14 19:20:59] INFO: GTDB-Tk v2.1.0
[2022-10-14 19:20:59] INFO: gtdbtk classify_wf --extension fa --cpus 14 --genome_dir . --out_dir gtdb/
[2022-10-14 19:20:59] INFO: Using GTDB-Tk reference data version r207: /home/IAME/db/gtdbtk-2.1.0/release207_v2
[2022-10-14 19:21:01] INFO: Identifying markers in 26 genomes with 14 threads.
[2022-10-14 19:21:01] TASK: Running Prodigal V2.6.3 to identify genes.
[2022-10-14 19:21:36] INFO: Completed 26 genomes in 34.98 seconds (1.35 seconds/genome).
[2022-10-14 19:21:36] TASK: Identifying TIGRFAM protein families.
[2022-10-14 19:21:45] INFO: Completed 26 genomes in 9.09 seconds (2.86 genomes/second).
[2022-10-14 19:21:45] TASK: Identifying Pfam protein families.
[2022-10-14 19:21:46] INFO: Completed 26 genomes in 0.86 seconds (30.35 genomes/second).
[2022-10-14 19:21:46] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2022-10-14 19:21:46] TASK: Summarising identified marker genes.
[2022-10-14 19:21:47] INFO: Completed 26 genomes in 0.78 seconds (33.44 genomes/second).
[2022-10-14 19:21:47] ERROR: Uncontrolled exit resulting from an unexpected error.

================================================================================
EXCEPTION: OSError
  MESSAGE: [Errno 95] Operation not supported: 'identify/gtdbtk.failed_genomes.tsv' -> 'gtdb/gtdbtk.failed_genomes.tsv'
________________________________________________________________________________

Traceback (most recent call last):
  File "/usr/bin/miniconda2/envs/gtdbtk-2.1.0/lib/python3.8/site-packages/gtdbtk/__main__.py", line 98, in main
    gt_parser.parse_options(args)
  File "/usr/bin/miniconda2/envs/gtdbtk-2.1.0/lib/python3.8/site-packages/gtdbtk/main.py", line 816, in parse_options
    self.identify(options)
  File "/usr/bin/miniconda2/envs/gtdbtk-2.1.0/lib/python3.8/site-packages/gtdbtk/main.py", line 271, in identify
    markers.identify(genomes,
  File "/usr/bin/miniconda2/envs/gtdbtk-2.1.0/lib/python3.8/site-packages/gtdbtk/markers.py", line 243, in identify
    self._report_identified_marker_genes(genome_dictionary, out_dir, prefix,
  File "/usr/bin/miniconda2/envs/gtdbtk-2.1.0/lib/python3.8/site-packages/gtdbtk/markers.py", line 117, in _report_identified_marker_genes
    symlink_f(PATH_FAILS.format(prefix=prefix),
  File "/usr/bin/miniconda2/envs/gtdbtk-2.1.0/lib/python3.8/site-packages/gtdbtk/tools.py", line 245, in symlink_f
    os.symlink(src, dst)
OSError: [Errno 95] Operation not supported: 'identify/gtdbtk.failed_genomes.tsv' -> 'gtdb/gtdbtk.failed_genomes.tsv'
================================================================================
caizhangbin commented 1 year ago

hi I got the same issue. and I tested with pplacer -m WAG -j 1 -c $GTDBTK_DATA_PATH/pplacer/gtdb_r89_bac120.refpkg -o /tmp/pplacer.bac120.json ~/classify_wf/align/gtdbtk.bac120.user_msa.fasta, and it showed Didn't find any reference sequences in given alignment file. Using supplied reference alignment. Pre-masking sequences... sequence length cut from 5035 to 4865. Determining figs... figs disabled. Allocating memory for internal nodes... Uncaught exception: Out of memory Fatal error: exception Out_of_memory