Problem in ksd step - Githubissues

mdgn15 commented 5 years ago

Hello,

Firstly, I am aware there is an open ticket about a similar problem but I didn't want to hijack that ticket because the problem might not be with codeml. And thank you for this great pipeline!

$ wgd --verbosity debug ksd cds.lactea_sample.mcl cds.lactea_sample.fa 
2019-08-01 15:25:06: DEBUG  CACHEDIR=/home/papaya/.cache/matplotlib
2019-08-01 15:25:06: DEBUG  Using fontManager instance from /home/papaya/.cache/matplotlib/fontlist-v310.json
2019-08-01 15:25:06: DEBUG  Loaded backend qt5agg version unknown.
2019-08-01 15:25:06: DEBUG  Loaded backend tkagg version unknown.
2019-08-01 15:25:06: DEBUG  Loaded backend TkAgg version unknown.
2019-08-01 15:25:06: INFO   
2019-08-01 15:25:06: INFO   codeml found
2019-08-01 15:25:06: INFO   MUSCLE v3.8.31 by Robert C. Edgar
2019-08-01 15:25:06: INFO   
2019-08-01 15:25:06: WARNING    Output directory exists, will possibly overwrite
2019-08-01 15:25:06: DEBUG  Reading CDS sequences
2019-08-01 15:25:06: INFO   Translating CDS file
2019-08-01 15:25:06: DEBUG  wrapping excepthook
100% (1001 of 1001) |##################################################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
2019-08-01 15:25:06: WARNING    There were 0 warnings during translation
2019-08-01 15:25:06: INFO   Started whole paranome Ks analysis
2019-08-01 15:25:06: WARNING    Filtered out the 0 largest gene families because n*(n-1)/2 > `max_pairwise`
2019-08-01 15:25:06: WARNING    If you want to analyse these large families anyhow, please raise the `max_pairwise` parameter. 
2019-08-01 15:25:06: INFO   Started analysis in parallel (n_threads = 4)
2019-08-01 15:25:06: INFO   Analysis done
2019-08-01 15:25:06: INFO   Making results data frame
2019-08-01 15:25:06: INFO   Removing tmp directory
2019-08-01 15:25:06: INFO   Computing weights, outlier cut-off at Ks > 5
Traceback (most recent call last):
  File "/home/papaya/anaconda2/envs/wgd/bin/wgd", line 10, in <module>
    sys.exit(cli())
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/wgd_cli.py", line 545, in ksd
    max_pairwise=max_pairwise
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/wgd_cli.py", line 686, in ksd_
    max_pairwise=max_pairwise,
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 665, in ks_analysis_paranome
    results_frame = compute_weights(results_frame)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/wgd/ks_distribution.py", line 709, in compute_weights
    df["WeightOutliersIncluded"] = 1 / df.groupby(['Family', 'Node'])[
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/pandas/core/generic.py", line 7632, in groupby
    observed=observed, **kwargs)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2110, in groupby
    return klass(obj, by, **kwds)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 360, in __init__
    mutated=self.mutated)
  File "/home/papaya/anaconda2/envs/wgd/lib/python3.6/site-packages/pandas/core/groupby/grouper.py", line 578, in _get_grouper
    raise KeyError(gpr)
KeyError: 'Node'

Now I am using a small sample from my whole cds data as in the supplementary method example from the publication. Can this be related to the problem? I am not very strong in bioinformatics so I would be really glad if you can help me out.

I am adding the input files in the attachment. Thanks in advance.

Edit: Spelling.

cds.lactea_sample.mcl.txt

cds.lactea_sample.fa.txt

arzwa commented 5 years ago

Hi, you have no multi-copy gene families in your mcl output, so you don't have any families to analyze in the ksd step. This is because of using the small test data set. If you'd use a larger data set, (with complete CDS sequences, as it seems you truncated them?) it will probably work just fine. I should change the code so that it exits with some more relevant error message instead of crashing like this, sorry!

mdgn15 commented 5 years ago

Ah yes, I started a run at night with my full data set and there were no errors. Thank you!

arzwa / wgd

Problem in ksd step #20