geronimp / graftM

GraftM - Rapid community profiles from metagenomes
http://geronimp.github.io/graftM/
GNU General Public License v3.0
44 stars 16 forks source link

graftM create_why error in building tree? #249

Open snmrna opened 6 years ago

snmrna commented 6 years ago

Hi, I ran graftM create with intention to create a pkgs for a list of protein sequences and got the following error: 04/15/2018 03:46:24 PM INFO: Building gpkg for GUS923.gpkg 04/15/2018 03:46:24 PM INFO: Building seqinfo and taxonomy file from input taxonomy 04/15/2018 03:46:24 PM INFO: Checking for duplicate sequences 04/15/2018 03:46:24 PM INFO: Aligning sequences to create aligned FASTA file 04/15/2018 03:46:48 PM INFO: Building HMM from alignment 04/15/2018 03:46:56 PM INFO: Filtered 0 short sequences from the alignment 04/15/2018 03:46:56 PM INFO: 923 sequences remaining 04/15/2018 03:46:56 PM INFO: Checking for incorrect or fragmented reads 04/15/2018 03:47:23 PM INFO: Building HMM from alignment 04/15/2018 03:47:32 PM INFO: Filtered 0 short sequences from the alignment 04/15/2018 03:47:32 PM INFO: 923 sequences remaining 04/15/2018 03:47:33 PM INFO: Deduplicating sequences 04/15/2018 03:47:33 PM INFO: Removed 47 sequences as duplicates, leaving 876 non-identical sequences 04/15/2018 03:47:33 PM INFO: Building tree Traceback (most recent call last): File "/usr/local/bin/graftM", line 4, in import('pkg_resources').run_script('graftm==0.11.1', 'graftM') File "/Users/weibin/Library/Python/2.7/lib/python/site-packages/pkg_resources/init.py", line 750, in run_script self.require(requires)[0].run_script(script_name, ns) File "/Users/weibin/Library/Python/2.7/lib/python/site-packages/pkg_resources/init.py", line 1527, in run_script exec(code, namespace, namespace) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/EGG-INFO/scripts/graftM", line 410, in Run(args).main() File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/run.py", line 657, in main threads = self.args.threads File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/create.py", line 730, in main self.fasttree) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/create.py", line 220, in _build_tree extern.run(cmd) File "build/bdist.macosx-10.13-intel/egg/extern/init.py", line 46, in run extern.ExternCalledProcessError: Command fasttree -quiet -log /var/folders/yn/9h2_d_7556970rsv05lmd6p80000gn/T/tmpP9qa3Y/GUS923.tre.log -out /var/folders/yn/9h2_d_7556970rsv05lmd6p80000gn/T/tmpP9qa3Y/GUS923.tre /var/folders/yn/9h2_d_7556970rsv05lmd6p80000gn/T/tmpP9qa3Y/GUS923_deduplicated_aligned.fasta returned non-zero exit status -11. STDERR was: Ignored unknown character X (seen 12 times) STDOUT was:

Anyone know the reason and how to fix it ? Thanks in advance.

wwood commented 6 years ago

Hi,

GraftM isn't regularly tested on OSX, so there is a possibility it is that.

But, the problem does seem to be fasttree specific. Would you mind running something like this to test please?

ps mafft input_proteins.faa >aligned.faa fasttree -log fasttree.log -out fasttree.tree aligned.faa echo $?

Thanks, ben

snmrna commented 6 years ago

Hi, wwood, Thanks very much for your quick reply! I have tested according to your suggestions and got the following feedback: weibintekiMacBook-Air:GraftM weibin$ mafft GUS.fasta >aligned.fasta

nthread = 0 nthreadpair = 0 nthreadtb = 0 stacksize: 8192 kb Gap Penalty = -1.53, +0.00, +0.00

Making a distance matrix ..

There are 170 ambiguous characters. 901 / 923 done.

Constructing a UPGMA tree (efffree=0) ... 920 / 923 done.

Progressive alignment 1/2... STEP 801 / 922 f Reallocating..done. alloclen = 6277 STEP 901 / 922 h Reallocating..done. alloclen = 7428

done.

Making a distance matrix from msa.. 900 / 923 done.

Constructing a UPGMA tree (efffree=1) ... 920 / 923 done.

Progressive alignment 2/2... STEP 901 / 922 h Reallocating..done. *alloclen = 5961

Reallocating..done. *alloclen = 8199

done.

disttbfast (aa) Version 7.394 alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0 0 thread(s)

Strategy: FFT-NS-2 (Fast but rough) Progressive method (guide trees were built 2 times.)

If unsure which option to use, try 'mafft --auto input > output'. For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct). It tends to insert more gaps into gap-rich regions than previous versions. To disable this change, add the --leavegappyregion option.

weibintekiMacBook-Air:GraftM weibin$ fasttree -log fasttree.log -out fasttree.tree aligned.fasta FastTree Version 2.1.10 SSE3 Alignment: aligned.fasta Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Ignored unknown character X (seen 170 times) Segmentation fault: 11 weibintekiMacBook-Air:GraftM weibin$ echo $? 139

Do you know how to fix it ? Sorry for my less experience. @wwood I succeed in creating a gpkg when I replaced the amino acid sequences in the fasta file with gene sequences.

wwood commented 6 years ago

Thanks for running that. It indeed points to an issue with fasttree, rather than the GraftM code itself not working on OSX for some reason.

I'm not sure why, but fasttree is crashing on your tree, here:

Ignored unknown character X (seen 170 times) Segmentation fault: 11

I've not seen this issue before. You have the newest version of FastTree running, so updating to fix isn't going to work.

I suspect then there is either something wrong with the way fasttree was compiled, or something particular to your sequences e.g. a sequence that is made up exclusively of X characters or perhaps something more subtle. Perhaps try making the tree on linux, or removing sequences from the alignment until it no longer segfaults.

Good luck. ben

snmrna commented 6 years ago

Thanks for your suggestions! It is the problem of fasttree. I fixed this bug after I compiled it with another command : gcc -DNO_SSE -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm.

But I still have another question. I have successfully installed graftm on a windows netobook (checked in python) and add the path to environment variables, but failed to run graftm in cmd? Do you know how to set my netobook to run graftm?

snmrna commented 6 years ago

Sorry for bother you again. @wwood When I created my own gpkg on a mac (input file: ~900 amino acid sequences), it seems that I got the gpkg for my proteins, but still report an error at the step "Testing gpkg package works", do you know what is wrong and how to fix this problem? Here is the report: 04/16/2018 04:48:39 PM INFO: Building gpkg for GUS923.gpkg 04/16/2018 04:48:39 PM INFO: Building seqinfo and taxonomy file from input taxonomy 04/16/2018 04:48:39 PM INFO: Checking for duplicate sequences 04/16/2018 04:48:39 PM INFO: Aligning sequences to create aligned FASTA file 04/16/2018 04:48:55 PM INFO: Building HMM from alignment 04/16/2018 04:49:02 PM INFO: Filtered 0 short sequences from the alignment 04/16/2018 04:49:02 PM INFO: 923 sequences remaining 04/16/2018 04:49:02 PM INFO: Checking for incorrect or fragmented reads 04/16/2018 04:49:17 PM INFO: Building HMM from alignment 04/16/2018 04:49:24 PM INFO: Filtered 0 short sequences from the alignment 04/16/2018 04:49:24 PM INFO: 923 sequences remaining 04/16/2018 04:49:24 PM INFO: Deduplicating sequences 04/16/2018 04:49:24 PM INFO: Removed 34 sequences as duplicates, leaving 889 non-identical sequences 04/16/2018 04:49:24 PM INFO: Building tree 04/16/2018 04:51:32 PM INFO: Building seqinfo and taxonomy file from input taxonomy 04/16/2018 04:51:32 PM INFO: Creating reference package 04/16/2018 04:51:32 PM INFO: Attempting to run taxit create with rerooting capabilities 04/16/2018 04:51:34 PM INFO: Creating diamond database 04/16/2018 04:51:34 PM INFO: Compiling gpkg 04/16/2018 04:51:34 PM INFO: Cleaning up 04/16/2018 04:51:34 PM INFO: Testing gpkg package works Traceback (most recent call last): File "/usr/local/bin/graftM", line 4, in import('pkg_resources').run_script('graftm==0.11.1', 'graftM') File "/Users/weibin/Library/Python/2.7/lib/python/site-packages/pkg_resources/init.py", line 750, in run_script self.require(requires)[0].run_script(script_name, ns) File "/Users/weibin/Library/Python/2.7/lib/python/site-packages/pkg_resources/init.py", line 1527, in run_script exec(code, namespace, namespace) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/EGG-INFO/scripts/graftM", line 410, in Run(args).main() File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/run.py", line 657, in main threads = self.args.threads File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/create.py", line 856, in main self._test_package(output_gpkg_path) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/create.py", line 537, in _test_package extern.run(cmd) File "build/bdist.macosx-10.13-intel/egg/extern/init.py", line 46, in run extern.ExternCalledProcessError: Command graftM graft --forward /var/folders/yn/9h2_d_7556970rsv05lmd6p80000gn/T/tmpv3rALm.fa --graftm_package GUS923.gpkg --output_directory /var/folders/yn/9h2_d_7556970rsv05lmd6p80000gn/T/tmpEh0HcE --force returned non-zero exit status 1. STDERR was: 04/16/2018 04:51:37 PM INFO: Working on tmpv3rALm Traceback (most recent call last): File "/usr/local/bin/graftM", line 4, in import('pkg_resources').run_script('graftm==0.11.1', 'graftM') File "/Users/weibin/Library/Python/2.7/lib/python/site-packages/pkg_resources/init.py", line 750, in run_script self.require(requires)[0].run_script(script_name, ns) File "/Users/weibin/Library/Python/2.7/lib/python/site-packages/pkg_resources/init.py", line 1527, in run_script exec(code, namespace, namespace) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/EGG-INFO/scripts/graftM", line 410, in Run(args).main() File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/run.py", line 588, in main self.graft() File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/run.py", line 377, in graft diamond_db File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/timeit.py", line 10, in timed result = method(*args, **kw) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/sequence_searcher.py", line 822, in aa_db_search hit_reads_orfs_fasta) File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/sequence_searcher.py", line 881, in search_and_extract_orfs_matching_protein_database unpack.sequence_type(), File "/Library/Python/2.7/site-packages/graftm-0.11.1-py2.7.egg/graftm/unpack_sequences.py", line 91, in sequencetype , seq = tuple(first_seq.strip().split('\n')) ValueError: need more than 1 value to unpack STDOUT was:

wwood commented 6 years ago

Hi, That seems like it might be a proper bug with GraftM. Are you able to send me the sequences you are trying to work with and the taxonomy file you used please? Just to my email which you can see at http://ecogenomic.org/personnel/dr-ben-woodcroft Thanks, ben