Closed mdieser closed 5 years ago
So it seems to be a problem with the codeml
output files. Which files do you see in the temporary directory created by wgd ksd
?
That's what i thought. I was also surprised by the low number of gene families. I only have GF files between GF_000001-GF_000080. For each one a .ctrl, .fasta.msa.nuc, .fasta.msa, and a .fasta is generated.
codeml
can be quite a finicky program, I see that your file paths contain a space, and I would not be surprised if codeml would break on that, so maybe try to rename Markus wgd
to Markus_wgd
? (BTW: The tmp files for each family are only generated when they are analyzed, so I think the program simply crashed before the analysis started for gene families after GF_000080)
That did resolve the current problem. Now I'm running into an issue that has been discussed previously. ISSUE #14 NewickError: Unexisting tree file or Malformed newick tree structure. The last few lines of the log output are attached below this text. In the previous issue you indicated that the problem was cased by characters in gene (transcript) names present in a specific GF.fasta.msa.nw. file. Based in the error message shown below, I assumed that there is an issue with GF_000001. However, my temp folder (ks_tmp.378ae285507cac) does not contain any .nw files. Any recommendations?
101 # Average linkage clustering based on Ks
102 logging.debug('Performing average linkage clustering on Ks values.')
...........................................................................
/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/phy.py in phylogenetic_tree_to_cluster_format(tree='/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw', pairwise_estimates= 1469 1545 1752 1764 ....6732 5.7719 0.0000
[74 rows x 74 columns])
118 (only the index is used)
119 :return: clustering data structure, pairwise distances dictionary
120 """
121 id_map = {
122 pairwise_estimates.index[i]: i for i in range(len(pairwise_estimates))}
--> 123 t = Tree(tree)
t = undefined
tree = '/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw'
124
125 # midpoint rooting
126 midpoint = t.get_midpoint_outgroup()
127 if not midpoint: # midpoint = None when their are only two leaves
...........................................................................
/software/miniconda2/envs/wgd/lib/python3.6/site-packages/ete3/coretype/tree.py in __init__(self=Tree node '' (-0x7ffff80fce7afc38), newick='/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw', format=0, dist=None, support=None, name=None, quoted_node_names=False)
206
207 # Initialize tree
208 if newick is not None:
209 self._dist = 0.0
210 read_newick(newick, root_node = self, format=format,
--> 211 quoted_names=quoted_node_names)
quoted_node_names = False
212
213
214 def __nonzero__(self):
215 return True
...........................................................................
/software/miniconda2/envs/wgd/lib/python3.6/site-packages/ete3/parser/newick.py in read_newick(newick='/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw', root_node=Tree node '' (-0x7ffff80fce7afc38), format=0, quoted_names=False)
244 nw = nw.strip()
245 if not nw.startswith('(') and nw.endswith(';'):
246 #return _read_node_data(nw[:-1], root_node, "single", matcher, format)
247 return _read_newick_from_string(nw, root_node, matcher, format, quoted_names)
248 elif not nw.startswith('(') or not nw.endswith(';'):
--> 249 raise NewickError('Unexisting tree file or Malformed newick tree structure.')
250 else:
251 return _read_newick_from_string(nw, root_node, matcher, format, quoted_names)
252
253 else:
NewickError: Unexisting tree file or Malformed newick tree structure.
You may want to check other newick loading flags like 'format' or 'quoted_node_names'.
Hi, I could not reproduce the issue, so I'm not sure what is causing this. When running your command on the two largest gene families I get:
λ arzwa-Aspire-S7-392 tmp → wgd ksd --n_threads 2 ./test2.mcl ./Input.fasta.txt
2019-07-17 20:51:02: INFO
2019-07-17 20:51:02: INFO codeml found
2019-07-17 20:51:02: INFO MUSCLE v3.8.31 by Robert C. Edgar
2019-07-17 20:51:02: INFO
2019-07-17 20:51:02: WARNING Output directory exists, will possibly overwrite
2019-07-17 20:51:02: INFO Translating CDS file
100% (6427 of 6427) |############################################| Elapsed Time: 0:00:01 Time: 0:00:01
2019-07-17 20:51:04: WARNING There were 0 warnings during translation
2019-07-17 20:51:04: INFO Started whole paranome Ks analysis
2019-07-17 20:51:04: WARNING Filtered out the 0 largest gene families because n*(n-1)/2 > `max_pairwise`
2019-07-17 20:51:04: WARNING If you want to analyse these large families anyhow, please raise the `max_pairwise` parameter.
2019-07-17 20:51:04: INFO Started analysis in parallel (n_threads = 2)
2019-07-17 20:51:04: INFO Performing analysis on gene family GF_000001
2019-07-17 20:51:04: INFO Performing analysis on gene family GF_000002
2019-07-17 20:54:38: INFO Analysis done
2019-07-17 20:54:38: INFO Making results data frame
2019-07-17 20:54:38: INFO Removing tmp directory
2019-07-17 20:54:38: INFO Computing weights, outlier cut-off at Ks > 5
2019-07-17 20:54:39: INFO Generating plots
2019-07-17 20:54:39: INFO Will plot **node-weighted** histograms
2019-07-17 20:54:41: INFO Done
So I suspect it has something to do with the external software called by wgd
. Could you try a different weighting method, like average linkage clustering (using --wm alc
)? (This does not rely on an external tree inference program).
BTW: while trying to reproduce you're issue I caught a different bug more downstream, due to the numerical gene IDs you use. I just pushed a new version that fixes the bug to github, so you might want to update your wgd version.
I truly appreciate all your help and advice! Your suggestion to include --wm alc definitely increased the length of processed GF before I run into another error. My IT administrator will update the wgd version. I will be field sampling for the next two weeks without access to my data. Can we please leave this ticket open until I get back for further testing?
Sure!
Hi - I think that my previous error message was partially related to the lack of space on my server to create temp files. That could explain why you were unable to reproduce it. General question,- how many gene families are there? As you can see below analysis stops at 877. The new error message is 'FloatingPointError: NaN dissimilarity value'. Any advice?
2019-08-01 11:14:44: INFO Performing analysis on gene family GF_000876
2019-08-01 11:14:44: INFO Performing analysis on gene family GF_000877
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in __call__
return self.func(*args, **kwargs)
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 303, in analyse_family
results_dict, msa=msa_path_protein, method=method)
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 103, in _weighting
clustering = average_linkage_clustering(pairwise_estimates['Ks'])
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/phy.py", line 171, in average_linkage_clustering
clustering = fastcluster.average(pairwise_estimates)
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/fastcluster.py", line 52, in average
return linkage(D, method='average')
File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/fastcluster.py", line 247, in linkage
linkage_wrap(N, X, Z, mthidx[method])
FloatingPointError: NaN dissimilarity value.
Sorry, I was traveling. This looks like a bug. I think some family has some invalid Ks estimate, breaking the average linkage clustering using fastcluster. I'll have a look and get back to you ASAP.
Hi, I just pushed a bug fix to github, so please install the latest commit. The problem was in some NaN Ks estimates for gene family 28 (see GF_000028.codeml
in your ks tmp directory).
Thanks for fixing the bug. The script finishes without an error. After reading several of your latest manuscripts it appears that wgd has mainly been tested on plant genomes. I am running wgd on bacterial genomes and was wondering about its applicability? I am particularly interested in gene duplication as an evolutionary response in bacteria.
Hi, I'd like to keep this github section for issue tracking. I'll close this issue and I suggest we can further communicate via email?
Hi - Thank you for making this analysis tool available. I am running a analysis on a genome of a bacterial environmental isolate. Similar to another issue the ksd command results in multiple errors. I used the .mcl file and the fasta file containing nucleotide sequences as an input for this command. I also tried the --verbosity debug option. Any idea what might cause the problem? I have also attached my input files.
Thanks for your time,
Markus
Here are the commands I used:
These are some of the errors I get which ultimately result in a 'TypeError: 'NoneType' object is not iterable' error
Input.fasta.blast.tsv.mcl.txt Input.fasta.txt