Closed watsonar closed 5 years ago
Hello. How long did you let this step run? It appears to simply not have finished running.
Hi! I never stop the job running, it simply stops on its own during this step. I've tried 3 times on separate days to see if it was an issue with our server, and this has happened every time.
In what sense does it "stop on it own"? Are you returned to the command line?
I'm running it on a server with clusterize, and when I check the queue it is gone from the queue. When I check the current processes that are running, it is also not listed. The log file created does not give an error or an exit code, which is odd, and otherwise ends with the same messages as the gtdbtk.log
file.
...
==> Finished processing 311 of 311 (100.0%) genomes.
[2019-01-07 15:11:48] INFO: Done.
[2019-01-07 15:11:48] INFO: Aligning markers in 311 genomes with 21 threads.
[2019-01-07 15:11:48] INFO: Processing 311 genomes identified as bacterial.
[2019-01-07 15:11:52] INFO: Read concatenated alignment for 21263 GTDB genomes.
[2019-01-07 15:15:42] INFO: Masking columns of multiple sequence alignment.
[2019-01-07 15:17:04] INFO: Masked alignment from 41155 to 5035 AA.
[2019-01-07 15:17:04] INFO: Creating concatenated alignment for 21574 taxa.
[2019-01-07 15:17:06] INFO: Done.
[2019-01-07 15:17:06] INFO: Placing 311 bacterial genomes into reference tree with pplacer (be patient).
Can you check the standard error log for your process? There should be some way to access this via you cluster environment. I am guessing the program has crashed with an error message that would be written to stderr and not the GTDB-Tk log.
My post above is the last few lines of the standard error log, there was no error code. However, I re-tried running this today with more memory and it worked. Thank you for your help!
Glad to hear you have resolved the issue. Thank you for letting me know.
I am having the same problem on the pplacer step, as I am trying to place around 5000 dereplicated MAGs into the tree. My latest job ran out of memory on a 2TB RAM server. report from SGE qacct: maxvmem 1941.981G. I am using 0.1.6.
The log file ends with
[2019-02-07 21:27:00] INFO: GTDB-Tk v0.1.6
[2019-02-07 21:27:00] INFO: gtdbtk classify --genome_dir
This is the content of pplacer.bac120.out: Running pplacer v1.1.alpha19-0-g807f6f3 analysis on gtdb_rug2_output/gtdb_rug2.bac120.user_msa.fasta... Didn't find any reference sequences in given alignment file. Using supplied reference alignment. WARNING: your tree has zero pendant branch lengths. This can lead to zero likelihood values which will keep you from being able to place sequences. You can remove identical sequences with seqmagick. Pre-masking sequences... sequence length cut from 5035 to 5035. Warning: pplacer results make the most sense when the given tree is multifurcating at the root. See manual for details. Determining figs... figs disabled. Allocating memory for internal nodes... done. Caching likelihood information on reference tree... done. Pulling exponents... done. Preparing the edges for baseball... done.
Hello. I'm not sure why the memory requirements are so high. We have placed ~10,000 MAGs with GTDB-Tk and have not observed this level of memory requirement. We use a 256 GB machine. Are all your MAGs of a size typical for prokaryotes? Perhaps the high memory requirement is due to an extremely large bin?
Hi Donovan, Thanks for the reply. Our largest bin is 6.5M and all others are 6M or less. These are all good quality MAGs according to checkm (>80% complete, <10% contamination), and have been dereplicated using dRep.
Is it possible pplacer goes crazy on memory usage when it has a lot of threads?
I typically run GTDB-Tk with 64 CPUs and haven't had memory issues. As a sanity check, I'd try running GTDB-Tk with a single MAG using a single CPU.
Reality check with one 2M MAG on one core with 256G RAM. Required 68G RAM for 'Allocating memory for internal nodes'. 83G during 'Caching likelihood information on reference tree', through to 'Preparing the edges for baseball'. Up to 162G to the end of pplacer. maxvmem 162.746G to completion. gtdb_test.bac120.summary.tsv present and populated with data, but this did take a lot of RAM for one small MAG.
For further information, I originally installed using the bioconda recipe, but hit this bug https://github.com/Ecogenomics/GTDBTk/issues/83 and so updated to 0.1.6 as follows: pip install gtdbtk -U EDIT to add - I am reinstalling from the latest bioconda recipe as I see it has updated to 0.1.6. Will run again on the same MAG to compare.
Hello. This is what I would expect. The memory usage for pplacer is related to loading the GTDB-Tk reference tree into memory. I do not expect that this should increase substantially with each MAG. If you are able to test this I would appreciate it. I have not explored this formally, but as I indicated I have run >10,000 MAGs through GTDB-Tk on a machine with only 256 GB.
Hello. A clean bioconda install of 0.1.6 produced the same result with the same memory requirement as above. Adding a second MAG bumped the maxvmem to 190G (with --debug).
-With 50 MAGs (48 BAC, 2 ARCH), on one core, maxvmem in 'Preparing the edges for baseball' was 104G. This jumped to 205G in the Working on
-Ran the same 50 genomes with 16 cores. This one ran out of memory (>512G) at Preparing the edges for baseball... done (as on my initial run).
-Ran the same 50 genomes with 2 cores and got this potentially more informative error in pplacer: Preparing the edges for baseball... done. error (Cannot allocate memory) when trying to fork error (Cannot allocate memory) when trying to fork couldn't fork any children; falling back to a single process. [Strangely, it carried on running after this and completed with maxvmem 103.931G]
So, there seem to be multiple problems 1) Greater memory usage for the same set of genomes with more cores. 2) local variable 'pplacer_leafnode' referenced before assignment
pplacer --version v1.1.alpha19-0-g807f6f3
FYI - Some of our MAGs were previously published in the 913 genomes paper. I assume this should not cause any problem.
Hello, There is indeed a bug in the classify.py file.
line 675:
else:
if len(sorted_dict) > 0:
summary_list[12] = '; '.join(self._formatnote(
sorted_dict, [fastani_matching_reference, pplacer_leafnode]))
unclassified_user_genomes[userleaf.taxon.label] = summary_list
return classified_user_genomes, unclassified_user_genomes
should be changed to:
else:
if len(sorted_dict) > 0:
summary_list[12] = '; '.join(self._formatnote(
sorted_dict, [fastani_matching_reference]))
unclassified_user_genomes[userleaf.taxon.label] = summary_list
return classified_user_genomes, unclassified_user_genomes
This issue has been fixed in #85 and will be available in the next release of GTDB-Tk.
Thanks. After manually editing that line and running with one cpu, 512G RAM, on ~5000 MAGs, I hit the error described here:
https://github.com/Ecogenomics/GTDBTk/issues/78
sh: line 1: 22950 Killed pplacer -j 1 -c
Trying again with 768G
Hello. Best I can tell the memory requirement reported by pplacer is inaccurate. If I run time --verbose
with 6 CPUs the Maximum resident set size
is reported as >400 GB. However, if you watch GTDB-Tk (more accurately pplacer) via top
the maximum memory usage never exceeds 128 GB. GTDB-Tk also runs to completion without issue on my machine with 256 GB of memory. I realize this isn't too satisifying an explination, but it appears to be the case from what I am observing.
Hello. Update: pplacer on 1 thread with 768G specified in SGE qsub has finished running with maxvmem=204.824G. This crashed (killed) previously on the exact same submission given 512G RAM, and the most likely servers have 768G. This makes me suspect pplacer was trying to reference memory outside of what was specified in qsub, but I cannot be sure.
pplacer took about 40h on this server for 4815 bacterial genomes. We have gtdb_rug2.bac120.classify.tree and gtdb_rug2.bac120.red_dictionary.tsv. Now it is running fastani, and has been for around 9h. Sadly, I think it might not complete in the 120h I gave the job.
Thank you for the update. Something is certainly weird about how pplacer allocates memory and Linux reports this. Again, my best guess is the Linux/SGE thinks pplacer needs excessive amounts of memory, but that this isn't actual the case if you run it directly on the hardware. The ANI step may not take that long depending on how novel your genomes are.
Classify completed in 54h total with 1 core. maxvmem remained 205G
Glad to hear GTDB-Tk ran to completion. We are looking into alternatives to pplacer to help with memory issues, but there aren't a lot of alternatives.
Hello,
I am trying to run the following command:
gtdbtk classify_wf --cpus 13 --batchfile MAG_paths.txt --out_dir gtdbtk_output
to assign taxonomy to a group of MAGs.However, the process always stops during the pplacer step and does not give any errors. The only file in the pplacer folder which is generated is
pplacer.bac120.out
, which contains the following:All the appropriate files for the identify and align steps are generated (though
gtdbtk.bac120.filtered.tsv
is empty) but none of the classify step files are produced (according to the expected output files described here).gtdbtk.log
contains the following:I would really appreciate any insight you could give me into why this might be happening.
Thank you so much, Andrea