arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

Problem with ksd #18

Closed mdieser closed 5 years ago

mdieser commented 5 years ago

Hi - Thank you for making this analysis tool available. I am running a analysis on a genome of a bacterial environmental isolate. Similar to another issue the ksd command results in multiple errors. I used the .mcl file and the fasta file containing nucleotide sequences as an input for this command. I also tried the --verbosity debug option. Any idea what might cause the problem? I have also attached my input files.

Thanks for your time,

Markus

Here are the commands I used:

wgd mcl --cds --mcl -s Input.fasta -o Input.wgd -n 40
wgd ksd --n_threads 40 Input.fasta.blast.tsv.mcl Input.fasta
wgd --verbosity debug ksd --n_threads 40 Input.fasta.blast.tsv.mcl Input.fastata

These are some of the errors I get which ultimately result in a 'TypeError: 'NoneType' object is not iterable' error

2019-07-15 22:20:53: INFO   codeml found
2019-07-15 22:20:53: INFO   MUSCLE v3.8.1551 by Robert C. Edgar
2019-07-15 22:20:53: INFO   
2019-07-15 22:20:53: WARNING    Output directory exists, will possibly overwrite
2019-07-15 22:20:53: INFO   Translating CDS file
100% (6427 of 6427) |######################################################################################| Elapsed Time: 0:00:01 Time:  0:00:01
2019-07-15 22:20:55: WARNING    There were 0 warnings during translation
2019-07-15 22:20:55: INFO   Started whole paranome Ks analysis
2019-07-15 22:20:55: WARNING    Filtered out the 0 largest gene families because n*(n-1)/2 > `max_pairwise`
2019-07-15 22:20:55: WARNING    If you want to analyse these large families anyhow, please raise the `max_pairwise` parameter. 
2019-07-15 22:20:55: INFO   Started analysis in parallel (n_threads = 40)
2019-07-15 22:20:55: INFO   Performing analysis on gene family GF_000001
2019-07-15 22:20:55: INFO   Performing analysis on gene family GF_000002
2019-07-15 22:20:55: INFO   Performing analysis on gene family GF_000003
2019-07-15 22:20:55: INFO   Performing analysis on gene family GF_000004
........
........
2019-07-15 22:20:56: INFO   Performing analysis on gene family GF_000040
2019-07-15 22:20:57: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000028.codeml not found
2019-07-15 22:20:57: INFO   Performing analysis on gene family GF_000041
2019-07-15 22:20:57: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000020.codeml not found
2019-07-15 22:20:57: INFO   Performing analysis on gene family GF_000042
2019-07-15 22:20:58: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000038.codeml not found
2019-07-15 22:20:58: INFO   Performing analysis on gene family GF_000043
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000032.codeml not found
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000017.codeml not found
2019-07-15 22:20:59: INFO   Performing analysis on gene family GF_000044
2019-07-15 22:20:59: INFO   Performing analysis on gene family GF_000045
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000025.codeml not found
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000029.codeml not found
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000035.codeml not found
2019-07-15 22:20:59: INFO   Performing analysis on gene family GF_000046
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000040.codeml not found
2019-07-15 22:20:59: INFO   Performing analysis on gene family GF_000047
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000030.codeml not found
2019-07-15 22:20:59: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/TEXT/GF_000041.codeml not found
2019-07-15 22:20:59: INFO   Performing analysis on gene family GF_000048
2019-07-15 22:20:59: INFO   Performing analysis on gene family GF_000049
........
........
2019-07-15 21:24:26: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000008.codeml not found
2019-07-15 21:24:29: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000010.codeml not found
2019-07-15 21:24:30: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000054.codeml not found
2019-07-15 21:24:35: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000005.codeml not found
2019-07-15 21:24:47: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000007.codeml not found
2019-07-15 21:25:12: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000004.codeml not found
2019-07-15 21:26:00: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000002.codeml not found
2019-07-15 21:26:26: WARNING    Codeml output file /media/LargeStorage/Markus wgd/CG23_2.wgd/ks_tmp.3789a71badfa0c/GF_000001.codeml not found
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in __call__
    return self.func(*args, **kwargs)
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 287, in analyse_family
    os.path.basename(msa_path), preserve=preserve, times=times)
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/software/miniconda2/envs/wgd/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 359, in __call__
    raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException
.
.
.

Input.fasta.blast.tsv.mcl.txt Input.fasta.txt

arzwa commented 5 years ago

So it seems to be a problem with the codeml output files. Which files do you see in the temporary directory created by wgd ksd?

mdieser commented 5 years ago

That's what i thought. I was also surprised by the low number of gene families. I only have GF files between GF_000001-GF_000080. For each one a .ctrl, .fasta.msa.nuc, .fasta.msa, and a .fasta is generated.

arzwa commented 5 years ago

codeml can be quite a finicky program, I see that your file paths contain a space, and I would not be surprised if codeml would break on that, so maybe try to rename Markus wgd to Markus_wgd? (BTW: The tmp files for each family are only generated when they are analyzed, so I think the program simply crashed before the analysis started for gene families after GF_000080)

mdieser commented 5 years ago

That did resolve the current problem. Now I'm running into an issue that has been discussed previously. ISSUE #14 NewickError: Unexisting tree file or Malformed newick tree structure. The last few lines of the log output are attached below this text. In the previous issue you indicated that the problem was cased by characters in gene (transcript) names present in a specific GF.fasta.msa.nw. file. Based in the error message shown below, I assumed that there is an issue with GF_000001. However, my temp folder (ks_tmp.378ae285507cac) does not contain any .nw files. Any recommendations?

    101         # Average linkage clustering based on Ks
    102         logging.debug('Performing average linkage clustering on Ks values.')

...........................................................................
/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/phy.py in phylogenetic_tree_to_cluster_format(tree='/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw', pairwise_estimates=          1469      1545      1752      1764    ....6732    5.7719    0.0000

[74 rows x 74 columns])
    118         (only the index is used)
    119     :return: clustering data structure, pairwise distances dictionary
    120     """
    121     id_map = {
    122         pairwise_estimates.index[i]: i for i in range(len(pairwise_estimates))}
--> 123     t = Tree(tree)
        t = undefined
        tree = '/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw'
    124 
    125     # midpoint rooting
    126     midpoint = t.get_midpoint_outgroup()
    127     if not midpoint:  # midpoint = None when their are only two leaves

...........................................................................
/software/miniconda2/envs/wgd/lib/python3.6/site-packages/ete3/coretype/tree.py in __init__(self=Tree node '' (-0x7ffff80fce7afc38), newick='/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw', format=0, dist=None, support=None, name=None, quoted_node_names=False)
    206 
    207         # Initialize tree
    208         if newick is not None:
    209             self._dist = 0.0
    210             read_newick(newick, root_node = self, format=format,
--> 211                         quoted_names=quoted_node_names)
        quoted_node_names = False
    212 
    213 
    214     def __nonzero__(self):
    215         return True

...........................................................................
/software/miniconda2/envs/wgd/lib/python3.6/site-packages/ete3/parser/newick.py in read_newick(newick='/media/LargeStorage/Markus_wgd/CG23_2.wgd/ks_tmp.378ae285507cac/GF_000001.fasta.msa.nw', root_node=Tree node '' (-0x7ffff80fce7afc38), format=0, quoted_names=False)
    244         nw = nw.strip()
    245         if not nw.startswith('(') and nw.endswith(';'):
    246             #return _read_node_data(nw[:-1], root_node, "single", matcher, format)
    247             return _read_newick_from_string(nw, root_node, matcher, format, quoted_names)
    248         elif not nw.startswith('(') or not nw.endswith(';'):
--> 249             raise NewickError('Unexisting tree file or Malformed newick tree structure.')
    250         else:
    251             return _read_newick_from_string(nw, root_node, matcher, format, quoted_names)
    252 
    253     else:

NewickError: Unexisting tree file or Malformed newick tree structure.
You may want to check other newick loading flags like 'format' or 'quoted_node_names'.
arzwa commented 5 years ago

Hi, I could not reproduce the issue, so I'm not sure what is causing this. When running your command on the two largest gene families I get:

λ arzwa-Aspire-S7-392 tmp → wgd ksd --n_threads 2 ./test2.mcl ./Input.fasta.txt 
2019-07-17 20:51:02: INFO   
2019-07-17 20:51:02: INFO   codeml found
2019-07-17 20:51:02: INFO   MUSCLE v3.8.31 by Robert C. Edgar
2019-07-17 20:51:02: INFO   
2019-07-17 20:51:02: WARNING    Output directory exists, will possibly overwrite
2019-07-17 20:51:02: INFO   Translating CDS file
100% (6427 of 6427) |############################################| Elapsed Time: 0:00:01 Time:  0:00:01
2019-07-17 20:51:04: WARNING    There were 0 warnings during translation
2019-07-17 20:51:04: INFO   Started whole paranome Ks analysis
2019-07-17 20:51:04: WARNING    Filtered out the 0 largest gene families because n*(n-1)/2 > `max_pairwise`
2019-07-17 20:51:04: WARNING    If you want to analyse these large families anyhow, please raise the `max_pairwise` parameter. 
2019-07-17 20:51:04: INFO   Started analysis in parallel (n_threads = 2)
2019-07-17 20:51:04: INFO   Performing analysis on gene family GF_000001
2019-07-17 20:51:04: INFO   Performing analysis on gene family GF_000002
2019-07-17 20:54:38: INFO   Analysis done
2019-07-17 20:54:38: INFO   Making results data frame
2019-07-17 20:54:38: INFO   Removing tmp directory
2019-07-17 20:54:38: INFO   Computing weights, outlier cut-off at Ks > 5
2019-07-17 20:54:39: INFO   Generating plots
2019-07-17 20:54:39: INFO   Will plot **node-weighted** histograms
2019-07-17 20:54:41: INFO   Done

So I suspect it has something to do with the external software called by wgd. Could you try a different weighting method, like average linkage clustering (using --wm alc)? (This does not rely on an external tree inference program).

BTW: while trying to reproduce you're issue I caught a different bug more downstream, due to the numerical gene IDs you use. I just pushed a new version that fixes the bug to github, so you might want to update your wgd version.

mdieser commented 5 years ago

I truly appreciate all your help and advice! Your suggestion to include --wm alc definitely increased the length of processed GF before I run into another error. My IT administrator will update the wgd version. I will be field sampling for the next two weeks without access to my data. Can we please leave this ticket open until I get back for further testing?

arzwa commented 5 years ago

Sure!

mdieser commented 5 years ago

Hi - I think that my previous error message was partially related to the lack of space on my server to create temp files. That could explain why you were unable to reproduce it. General question,- how many gene families are there? As you can see below analysis stops at 877. The new error message is 'FloatingPointError: NaN dissimilarity value'. Any advice?

2019-08-01 11:14:44: INFO   Performing analysis on gene family GF_000876
2019-08-01 11:14:44: INFO   Performing analysis on gene family GF_000877
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in __call__
    return self.func(*args, **kwargs)
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 303, in analyse_family
    results_dict, msa=msa_path_protein, method=method)
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 103, in _weighting
    clustering = average_linkage_clustering(pairwise_estimates['Ks'])
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/wgd/phy.py", line 171, in average_linkage_clustering
    clustering = fastcluster.average(pairwise_estimates)
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/fastcluster.py", line 52, in average
    return linkage(D, method='average')
  File "/software/miniconda2/envs/wgd/lib/python3.6/site-packages/fastcluster.py", line 247, in linkage
    linkage_wrap(N, X, Z, mthidx[method])
FloatingPointError: NaN dissimilarity value.
arzwa commented 5 years ago

Sorry, I was traveling. This looks like a bug. I think some family has some invalid Ks estimate, breaking the average linkage clustering using fastcluster. I'll have a look and get back to you ASAP.

arzwa commented 5 years ago

Hi, I just pushed a bug fix to github, so please install the latest commit. The problem was in some NaN Ks estimates for gene family 28 (see GF_000028.codeml in your ks tmp directory).

mdieser commented 5 years ago

Thanks for fixing the bug. The script finishes without an error. After reading several of your latest manuscripts it appears that wgd has mainly been tested on plant genomes. I am running wgd on bacterial genomes and was wondering about its applicability? I am particularly interested in gene duplication as an evolutionary response in bacteria.

arzwa commented 5 years ago

Hi, I'd like to keep this github section for issue tracking. I'll close this issue and I suggest we can further communicate via email?