arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

standard error of Ks from codeml #33

Closed igigkuo closed 2 years ago

igigkuo commented 4 years ago

Hello,

I really enjoy using this tool for my Ks analysis. For some reason I have to get the standard error of Ks, and it would be perfect for me to get SE during the Ks analysis with wgd. I have tried a stupid way to change the getSE = 1 in the codeml.py and have --preserve to get the codeml result, and then wgd ksd failed before generating Ks result for a major part of the gene families, although there are still a few gene families succeeded in producing Ks result. I attached the error report here. This error is triggered by getSE = 1 in codeml.py, and the process goes well when getSE = 0. Do you have any advise?

Thank you very much for your help!

nohup.txt

arzwa commented 4 years ago

I had a look, and it seems that when using getSE=1, sometimes codeml stops without crashing before all Ks values are estimated. It looks like this when you run codeml directly:

$ codeml GF_000002.ctrl  
CODONML in paml version 4.9h, March 2018

var(dS,dN) not calculable.
var(dS,dN) not calculable.
var(dS,dN) not calculable.
var(dS,dN) not calculable.
var(dS,dN) not calculable.
var(dS,dN) not calculable.

xmax = 0.0000e+00 close to zero at   1!

I modified the code to first check whether all pairwise estimates are in the codeml output file, raising a non-crashing error when this is not the case. I also implemented for your convenience a function to pass options for codeml from the command line in wgd ksd, so no need to hack into the codeml.py source code anymore. So now you can do:

wgd ksd <gene families> <CDS fasta> -tmp tmpdir  --preserve --codeml_options getSE=1

Instead of breaking when codeml fails it should now look like this:

2020-05-13 09:46:39: INFO   Started analysis in parallel (n_threads = 4)
2020-05-13 09:46:39: INFO   Performing analysis on gene family GF_000001
2020-05-13 09:46:39: INFO   Performing analysis on gene family GF_000002
2020-05-13 09:46:39: INFO   Performing analysis on gene family GF_000003
2020-05-13 09:46:39: INFO   Performing analysis on gene family GF_000004
2020-05-13 09:47:02: ERROR  Not all gene pairs present in /home/arzwa/dev/wgd/_test/tmpdir3/GF_000001.codeml
2020-05-13 09:47:02: WARNING    No codeml results for GF_000001!
2020-05-13 09:47:02: INFO   Performing analysis on gene family GF_000005

I haven't implemented parsing out the standard errors from the codeml output (I'm working on a new version of wgd and I'll try to remember implementing it there), so you'll still need to use --preserve and obtain the SEs yourself after wgd ksd.

Please install this updated version (latest commit on the master branch). This has not been thoroughly tested, so please let me know if you still encounter issues.

igigkuo commented 4 years ago

Hi,

Thanks for your quick response! It's great to have the --codeml_options to use, really appreciate for that! I have installed the new version and see the new --codeml_options, but there is another error occurred as below:

2020-05-14 02:36:36: INFO   AAML in paml version 4.9, March 2015
2020-05-14 02:36:36: INFO   codeml found
Traceback (most recent call last):
  File "/home/kuokuo/anaconda3/envs/python36/bin/wgd", line 11, in <module>
    load_entry_point('wgd==1.2', 'console_scripts', 'wgd')()
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/wgd-1.2-py3.6.egg/wgd_cli.py", line 639, in ksd
    min_msa_length=100, ignore_prefixes=False, one_v_one=False,
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/wgd-1.2-py3.6.egg/wgd_cli.py", line 690, in ksd_
    logging.error('No gene families or no sequences provided.')
  File "/home/kuokuo/anaconda3/envs/python36/lib/python3.6/site-packages/wgd-1.2-py3.6.egg/wgd/utils.py", line 75, in can_i_run_software
FileNotFoundError: [Errno 2] No such file or directory: 'codeml.ctl'

The same error occurred regardless of using this option or not. Thank you again for your kind help!

Sincerely,

arzwa commented 4 years ago

Hi, I tried again with a fresh installation, and I cannot reproduce this error on python3.7. What is your exact command?

igigkuo commented 4 years ago

I run tests on both python3.7 and python3.6 (in different conda env) with the commands:

wgd ksd -n 10 <gene families> <CDS fasta> --preserve
wgd ksd -n 10 <gene families> <CDS fasta>
wgd ksd <gene families> <CDS fasta>

All show the same report. When I re-install the old version, it works as previous did.

igigkuo commented 4 years ago

Hi,

I tried to uninstall and reinstall wgd v1.2, but still got the same error. The good news is, I find out that when the gene families contains only two sequences, the Ks calculation by codeml went well, and the whole process with wgd v1.1 completed prefectly! Although I still don't know why wgd v1.2 doesn't work in my system, but at least the original idea is feasible.

Sincerely,

arzwa commented 4 years ago

Hi, I don't understand. From the error message, the program seems to crash when trying to run codeml but that part of the code is exactly the same in the two versions of wgd. Is the codeml executable the same in both conda enviroments? It seems from the message above you are using an outdated PAML version BTW, I'm running PAML 4.9h.

igigkuo commented 4 years ago

Hi,

I think this is caused by my system environment, but I have no idea right now. My system is CentOS Linux release 7.8.2003. I used base environment and python36 environment for wgd analysis, and used python setup.py install to install wgd. The conda list for the two env are attached. I saw in the two env the paml version are both 4.9.

I noticed in wgd v1.2, it showed 2020-05-14 02:36:36: INFO AAML in paml version 4.9, March 2015 before 2020-05-14 02:36:36: INFO codeml found while in wgd v1.1 only the second line was shown. Is this normal in the new version?

I'll try some other test for the new version and will update here if I find out something new.

Sincerely,

timurrxxcd commented 2 months ago

I tried to run MCL clustring. I installed Paml package. but it still says "ERROR codeml executable not found!". can someone hel me please? image