arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

wgd ksd never finishes #31

Closed esud1 closed 4 years ago

esud1 commented 4 years ago

Hi,

I am having a bit of trouble with the wgd ksd step.

The program can run smoothly and produces a number of files in the tmp folder (incl. fasta, msa, and Ks files), but at some point in time, it just freezes and nothing happened after that. No info was given from the command line interface. I have let my current program run for 4 days, but still no changes. Here is the screenshot of the last INFO given by the program

image

I tried running a smaller subset of the data (1000 random ones, similar to the supplemental info in the paper), and the program has no problem giving the output.

Do you know what went wrong? Could it be due to the size of the data or other problems? FYI, the CDS fasta file is ~30 Mb, and the mcl file is ~369 kb.

arzwa commented 4 years ago

Not sure, but I remember encountering such a stalling issue once due to the codeml program (which is used under the hood to estimate Ks using ML) hanging without giving an error on some gene family. I would suggest two things to figure this out: (1) run the program with the -v debug flag (this will give you a lot more information) and (2) try to locate if there is a specific gene family that is causing this freezing. If you could obtain a small data set for which you can reproduce the issue, that would be helpful for me (I strongly doubt it has anything to do with the size of your data set).

esud1 commented 4 years ago

After running a few more runs, I notice that the program always freezes when analyzing the same gene families (the last processed file). But when I extract those gene families and run them on their own, wgd ksd seems to be working fine.

Should I remove these gene families and try to re-run the program?

Thanks

arzwa commented 4 years ago

That sounds strange, could you provide me with a test data set (CDS sequences and families) so that I can try to reproduce the problem? Preferably a relatively small subset for which you observe this issue.

esud1 commented 4 years ago

Hi, I tried to perform it on a smaller subset (~5k sequences) and encountered no problem. However, when I scaled it up to ~10k sequences, the program stalled. I had a look at their latest temp files, and it seems that there were no problems with the codeml program; the .Ks files were generated. I think the problem lies when the program is trying to merge all of these files to create the plot and .tsv files.

I also tried to run wgd on Arabidopsis whole CDS data (downloaded from PLAZA), and encountered the same problem - the program finishes the codeml part, but could not move forward from there.

btw, I am using phyml (v 3.3.20190321) instead of FastTree. Will it affect the run?

arzwa commented 4 years ago

HI, I'll try to figure this out. I usually don't use phyml for the trees, so it could have something to do with that. Maybe some of the largest families take up on inordinate amount of time? (You could check the active processes using top or htop on linux, perhaps you see phyml still running, or just try running phyml on the largest gene family). (BTW: If you do not plan to use the trees afterwards, I'd recommend using fasttree, or the clustering approach, as an occasional tree error will barely affect the distribution).

esud1 commented 4 years ago

Hi Arthur,

I tried to run wgd with FastTree and the problem is fixed! And yes, the problem lies with the phyml, I could saw that phyml was still running in the background when I checked using top. Many thanks for your help! :)