classify_wf workflow stops during pplacer step

watsonar commented 5 years ago

Hello,

I am trying to run the following command: gtdbtk classify_wf --cpus 13 --batchfile MAG_paths.txt --out_dir gtdbtk_output to assign taxonomy to a group of MAGs.

However, the process always stops during the pplacer step and does not give any errors. The only file in the pplacer folder which is generated is pplacer.bac120.out, which contains the following:

Running pplacer v1.1.alpha18-2-gcb55169 analysis on gtdbtk_output/gtdbtk.bac120.user_msa.fasta...
Didn't find any reference sequences in given alignment file. Using supplied reference alignment.
WARNING: your tree has zero pendant branch lengths. This can lead to zero likelihood values which will keep you from being able to place sequences. You can remove identical sequences with seqmagick.
Pre-masking sequences... sequence length cut from 5035 to 5035.
Warning: pplacer results make the most sense when the given tree is multifurcating at the root. See manual for details.
Determining figs... figs disabled.
Allocating memory for internal nodes... done.
Caching likelihood information on reference tree... done.
Pulling exponents... done.
Preparing the edges for baseball... done.

All the appropriate files for the identify and align steps are generated (though gtdbtk.bac120.filtered.tsv is empty) but none of the classify step files are produced (according to the expected output files described here).

gtdbtk.log contains the following:

[2018-12-18 14:42:12] INFO: GTDB-Tk v0.1.3
[2018-12-18 14:42:12] INFO: gtdbtk classify_wf --cpus 13 --batchfile MAG_paths.txt --out_dir gtdbtk_output
[2018-12-18 14:42:12] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-12-18 14:42:13] INFO: Identifying markers in 311 genomes with 13 threads.
[2018-12-18 14:42:13] INFO: Running Prodigal to identify genes.
[2018-12-18 14:50:54] INFO: Identifying TIGRFAM protein families.
[2018-12-18 14:56:52] INFO: Identifying Pfam protein families.
[2018-12-18 14:58:41] INFO: Done.
[2018-12-18 14:58:42] INFO: Aligning markers in 311 genomes with 13 threads.
[2018-12-18 14:58:42] INFO: Processing 311 genomes identified as bacterial.
[2018-12-18 14:58:44] INFO: Read concatenated alignment for 21263 GTDB genomes.
[2018-12-18 15:05:42] INFO: Masking columns of multiple sequence alignment.
[2018-12-18 15:07:02] INFO: Masked alignment from 41155 to 5035 AA.
[2018-12-18 15:07:02] INFO: Creating concatenated alignment for 21574 taxa.
[2018-12-18 15:07:04] INFO: Done.
[2018-12-18 15:07:04] INFO: Placing 311 bacterial genomes into reference tree with pplacer (be patient).
[2018-12-20 11:44:42] INFO: GTDB-Tk v0.1.3
[2018-12-20 11:44:42] INFO: gtdbtk classify_wf --cpus 13 --batchfile MAG_paths.txt --out_dir gtdbtk_output
[2018-12-20 11:44:42] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-12-20 11:44:42] INFO: Identifying markers in 311 genomes with 13 threads.
[2018-12-20 11:44:42] INFO: Running Prodigal to identify genes.
[2018-12-20 11:53:30] INFO: Identifying TIGRFAM protein families.
[2018-12-20 11:59:22] INFO: Identifying Pfam protein families.
[2018-12-20 12:00:16] INFO: Done.
[2018-12-20 12:00:16] INFO: Aligning markers in 311 genomes with 13 threads.
[2018-12-20 12:00:16] INFO: Processing 311 genomes identified as bacterial.
[2018-12-20 12:00:18] INFO: Read concatenated alignment for 21263 GTDB genomes.
[2018-12-20 12:06:25] INFO: Masking columns of multiple sequence alignment.
[2018-12-20 12:07:46] INFO: Masked alignment from 41155 to 5035 AA.
[2018-12-20 12:07:46] INFO: Creating concatenated alignment for 21574 taxa.
[2018-12-20 12:07:48] INFO: Done.
[2018-12-20 12:07:49] INFO: Placing 311 bacterial genomes into reference tree with pplacer (be patient).
[2019-01-07 14:58:45] INFO: GTDB-Tk v0.1.3
[2019-01-07 14:58:45] INFO: gtdbtk classify_wf --cpus 21 --batchfile MAG_paths.txt --out_dir gtdbtk_output --debug
[2019-01-07 14:58:45] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2019-01-07 14:58:45] INFO: Identifying markers in 311 genomes with 21 threads.
[2019-01-07 14:58:45] INFO: Running Prodigal to identify genes.
[2019-01-07 15:02:28] INFO: GTDB-Tk v0.1.3
[2019-01-07 15:02:28] INFO: gtdbtk classify_wf --cpus 21 --batchfile MAG_paths.txt --out_dir gtdbtk_output --debug
[2019-01-07 15:02:28] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2019-01-07 15:02:28] INFO: Identifying markers in 311 genomes with 21 threads.
[2019-01-07 15:02:28] INFO: Running Prodigal to identify genes.
[2019-01-07 15:07:44] INFO: Identifying TIGRFAM protein families.
[2019-01-07 15:11:08] INFO: Identifying Pfam protein families.
[2019-01-07 15:11:48] INFO: Done.
[2019-01-07 15:11:48] INFO: Aligning markers in 311 genomes with 21 threads.
[2019-01-07 15:11:48] INFO: Processing 311 genomes identified as bacterial.
[2019-01-07 15:11:52] INFO: Read concatenated alignment for 21263 GTDB genomes.
[2019-01-07 15:15:42] INFO: Masking columns of multiple sequence alignment.
[2019-01-07 15:17:04] INFO: Masked alignment from 41155 to 5035 AA.
[2019-01-07 15:17:04] INFO: Creating concatenated alignment for 21574 taxa.
[2019-01-07 15:17:06] INFO: Done.
[2019-01-07 15:17:06] INFO: Placing 311 bacterial genomes into reference tree with pplacer (be patient).

I would really appreciate any insight you could give me into why this might be happening.

Thank you so much, Andrea

donovan-h-parks commented 5 years ago

Hello. How long did you let this step run? It appears to simply not have finished running.

watsonar commented 5 years ago

Hi! I never stop the job running, it simply stops on its own during this step. I've tried 3 times on separate days to see if it was an issue with our server, and this has happened every time.

donovan-h-parks commented 5 years ago

In what sense does it "stop on it own"? Are you returned to the command line?

watsonar commented 5 years ago

I'm running it on a server with clusterize, and when I check the queue it is gone from the queue. When I check the current processes that are running, it is also not listed. The log file created does not give an error or an exit code, which is odd, and otherwise ends with the same messages as the gtdbtk.log file.

...

==> Finished processing 311 of 311 (100.0%) genomes.
[2019-01-07 15:11:48] INFO: Done.
[2019-01-07 15:11:48] INFO: Aligning markers in 311 genomes with 21 threads.
[2019-01-07 15:11:48] INFO: Processing 311 genomes identified as bacterial.
[2019-01-07 15:11:52] INFO: Read concatenated alignment for 21263 GTDB genomes.
[2019-01-07 15:15:42] INFO: Masking columns of multiple sequence alignment.
[2019-01-07 15:17:04] INFO: Masked alignment from 41155 to 5035 AA.
[2019-01-07 15:17:04] INFO: Creating concatenated alignment for 21574 taxa.
[2019-01-07 15:17:06] INFO: Done.
[2019-01-07 15:17:06] INFO: Placing 311 bacterial genomes into reference tree with pplacer (be patient).

donovan-h-parks commented 5 years ago

Can you check the standard error log for your process? There should be some way to access this via you cluster environment. I am guessing the program has crashed with an error message that would be written to stderr and not the GTDB-Tk log.

watsonar commented 5 years ago

My post above is the last few lines of the standard error log, there was no error code. However, I re-tried running this today with more memory and it worked. Thank you for your help!

donovan-h-parks commented 5 years ago

Glad to hear you have resolved the issue. Thank you for letting me know.

Quicken-up commented 5 years ago

I am having the same problem on the pplacer step, as I am trying to place around 5000 dereplicated MAGs into the tree. My latest job ran out of memory on a 2TB RAM server. report from SGE qacct: maxvmem 1941.981G. I am using 0.1.6.

The log file ends with [2019-02-07 21:27:00] INFO: GTDB-Tk v0.1.6 [2019-02-07 21:27:00] INFO: gtdbtk classify --genome_dir -x fa --align_dir gtdb_rug2_output --out_dir gtdb_rug2_output --prefix gtdb_rug2 --cpus 40 [2019-02-07 21:27:01] INFO: Placing 4815 bacterial genomes into reference tree with pplacer (be patient).

This is the content of pplacer.bac120.out: Running pplacer v1.1.alpha19-0-g807f6f3 analysis on gtdb_rug2_output/gtdb_rug2.bac120.user_msa.fasta... Didn't find any reference sequences in given alignment file. Using supplied reference alignment. WARNING: your tree has zero pendant branch lengths. This can lead to zero likelihood values which will keep you from being able to place sequences. You can remove identical sequences with seqmagick. Pre-masking sequences... sequence length cut from 5035 to 5035. Warning: pplacer results make the most sense when the given tree is multifurcating at the root. See manual for details. Determining figs... figs disabled. Allocating memory for internal nodes... done. Caching likelihood information on reference tree... done. Pulling exponents... done. Preparing the edges for baseball... done.

donovan-h-parks commented 5 years ago

Hello. I'm not sure why the memory requirements are so high. We have placed ~10,000 MAGs with GTDB-Tk and have not observed this level of memory requirement. We use a 256 GB machine. Are all your MAGs of a size typical for prokaryotes? Perhaps the high memory requirement is due to an extremely large bin?

Quicken-up commented 5 years ago

Hi Donovan, Thanks for the reply. Our largest bin is 6.5M and all others are 6M or less. These are all good quality MAGs according to checkm (>80% complete, <10% contamination), and have been dereplicated using dRep.

Is it possible pplacer goes crazy on memory usage when it has a lot of threads?

donovan-h-parks commented 5 years ago

I typically run GTDB-Tk with 64 CPUs and haven't had memory issues. As a sanity check, I'd try running GTDB-Tk with a single MAG using a single CPU.

Quicken-up commented 5 years ago

Reality check with one 2M MAG on one core with 256G RAM. Required 68G RAM for 'Allocating memory for internal nodes'. 83G during 'Caching likelihood information on reference tree', through to 'Preparing the edges for baseball'. Up to 162G to the end of pplacer. maxvmem 162.746G to completion. gtdb_test.bac120.summary.tsv present and populated with data, but this did take a lot of RAM for one small MAG.

For further information, I originally installed using the bioconda recipe, but hit this bug https://github.com/Ecogenomics/GTDBTk/issues/83 and so updated to 0.1.6 as follows: pip install gtdbtk -U EDIT to add - I am reinstalling from the latest bioconda recipe as I see it has updated to 0.1.6. Will run again on the same MAG to compare.

donovan-h-parks commented 5 years ago

Hello. This is what I would expect. The memory usage for pplacer is related to loading the GTDB-Tk reference tree into memory. I do not expect that this should increase substantially with each MAG. If you are able to test this I would appreciate it. I have not explored this formally, but as I indicated I have run >10,000 MAGs through GTDB-Tk on a machine with only 256 GB.

Quicken-up commented 5 years ago

Hello. A clean bioconda install of 0.1.6 produced the same result with the same memory requirement as above. Adding a second MAG bumped the maxvmem to 190G (with --debug).

-With 50 MAGs (48 BAC, 2 ARCH), on one core, maxvmem in 'Preparing the edges for baseball' was 104G. This jumped to 205G in the Working on (3/48) ... step (I think - possible qstat lag), and stayed at that level through the remaining Working on... steps of pplacer.bac120.out, and dropped to <10G during FastANI. Process stopped with this error: [2019-02-11 13:44:04] INFO: Placing 48 bacterial genomes into reference tree with pplacer (be patient). [2019-02-11 14:24:02] INFO: Calculating Average Nucleotide Identity using FastANI. GTDB-Tk has stopped before finishing local variable 'pplacer_leafnode' referenced before assignment ('Unexpected error:', <type 'exceptions.UnboundLocalError'>)

-Ran the same 50 genomes with 16 cores. This one ran out of memory (>512G) at Preparing the edges for baseball... done (as on my initial run).

-Ran the same 50 genomes with 2 cores and got this potentially more informative error in pplacer: Preparing the edges for baseball... done. error (Cannot allocate memory) when trying to fork error (Cannot allocate memory) when trying to fork couldn't fork any children; falling back to a single process. [Strangely, it carried on running after this and completed with maxvmem 103.931G]

So, there seem to be multiple problems 1) Greater memory usage for the same set of genomes with more cores. 2) local variable 'pplacer_leafnode' referenced before assignment

pplacer --version v1.1.alpha19-0-g807f6f3

FYI - Some of our MAGs were previously published in the 913 genomes paper. I assume this should not cause any problem.

pchaumeil commented 5 years ago

Hello, There is indeed a bug in the classify.py file.

line 675:

                else:
                    if len(sorted_dict) > 0:
                        summary_list[12] = '; '.join(self._formatnote(
                            sorted_dict, [fastani_matching_reference, pplacer_leafnode]))
                    unclassified_user_genomes[userleaf.taxon.label] = summary_list
        return classified_user_genomes, unclassified_user_genomes

should be changed to:

                else:
                    if len(sorted_dict) > 0:
                        summary_list[12] = '; '.join(self._formatnote(
                            sorted_dict, [fastani_matching_reference]))
                    unclassified_user_genomes[userleaf.taxon.label] = summary_list
        return classified_user_genomes, unclassified_user_genomes

This issue has been fixed in #85 and will be available in the next release of GTDB-Tk.

Quicken-up commented 5 years ago

Thanks. After manually editing that line and running with one cpu, 512G RAM, on ~5000 MAGs, I hit the error described here: https://github.com/Ecogenomics/GTDBTk/issues/78 sh: line 1: 22950 Killed pplacer -j 1 -c /gtdb_data/release86/pplacer/gtdb_r86_bac120.refpkg -o gtdb_rug2_output/pplacer/pplacer.bac120.json gtdb_rug2_output/gtdb_rug2.bac120.user_msa.fasta > gtdb_rug2_output/pplacer/pplacer.bac120.out Uncaught exception: Sys_error("gtdb_rug2_output/pplacer/pplacer.bac120.json: No such file or directory") Fatal error: exception Sys_error("gtdb_rug2_output/pplacer/pplacer.bac120.json: No such file or directory")

Trying again with 768G

donovan-h-parks commented 5 years ago

Hello. Best I can tell the memory requirement reported by pplacer is inaccurate. If I run time --verbose with 6 CPUs the Maximum resident set size is reported as >400 GB. However, if you watch GTDB-Tk (more accurately pplacer) via top the maximum memory usage never exceeds 128 GB. GTDB-Tk also runs to completion without issue on my machine with 256 GB of memory. I realize this isn't too satisifying an explination, but it appears to be the case from what I am observing.

Quicken-up commented 5 years ago

Hello. Update: pplacer on 1 thread with 768G specified in SGE qsub has finished running with maxvmem=204.824G. This crashed (killed) previously on the exact same submission given 512G RAM, and the most likely servers have 768G. This makes me suspect pplacer was trying to reference memory outside of what was specified in qsub, but I cannot be sure.

pplacer took about 40h on this server for 4815 bacterial genomes. We have gtdb_rug2.bac120.classify.tree and gtdb_rug2.bac120.red_dictionary.tsv. Now it is running fastani, and has been for around 9h. Sadly, I think it might not complete in the 120h I gave the job.

donovan-h-parks commented 5 years ago

Thank you for the update. Something is certainly weird about how pplacer allocates memory and Linux reports this. Again, my best guess is the Linux/SGE thinks pplacer needs excessive amounts of memory, but that this isn't actual the case if you run it directly on the hardware. The ANI step may not take that long depending on how novel your genomes are.

Quicken-up commented 5 years ago

Classify completed in 54h total with 1 core. maxvmem remained 205G

donovan-h-parks commented 5 years ago

Glad to hear GTDB-Tk ran to completion. We are looking into alternatives to pplacer to help with memory issues, but there aren't a lot of alternatives.

Ecogenomics / GTDBTk

classify_wf workflow stops during pplacer step #70