arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 40 forks source link

Can I use wgd for de novo transcriptomes #47

Closed lychen83 closed 2 years ago

lychen83 commented 3 years ago

Dear Arthur,

I have several questions about the wgd. 1), can I use the wgd for my de novo transcriptome data? 2), I have tried wgd for my denovo transcriptome data. The maximum numbers for duplications of the samples are around 100 - 350. Are they too low? 3), can I use the wgd for estimating the between species Ks plot. I have tried the command 'wgd dmd spcies1.cds.fa species2. cds.fa'. I suppose the Ks peak recovered from analysis should correspond to the split between the species1 and species2. Then, comparing the location of Ks peak (a WGD event) of species1, Ks peak of species2 (a WGD event), and the Ks peak between the two species, I can know if WGD events happened before or after the split of the two species. Is my understanding right? However, after running the command 'wgd dmd spcies1.cds.fa species2. cds.fa', I did not get an svg file, which may show the Ks distribution. What additional steps need? 4), I want to plot the Ks distribution for multiple species in one figure (e.g. the Fig. 1a in your Zwaenepoel1 and de Peer (2019) paper). How can I achieve it with my Linux workstation?

I appreciate for help. Best,

Lingyun

arzwa commented 3 years ago

(1) Yes but you have to make sure to only provide proper coding sequence data, your sequences must be perfectly translatable (i.e. must not contain stop codons, and must be have a length that is a multiple of 3) (2) Make sure your data is properly translatable (see point 1), because sequences that are not will be discarded. You may also want to try out the --pairwise flag if the pipeline discards too many families because of families with many gaps in the alignments. (3) Yes you can use wgd for that. Note that the interpretation of Ks as time depends on the substitution rates. If the substitution rates differ across lineages examined, you cannot directly compare the Ks distributions, so beware. WIth regard to your issues, you have to run wgd ksd on the output of the wgd dmd command, the latter only identifies reciprocal best hits as putative orthologs. (4) You can use wgd viz to plot multiple distributions. Alternatively you can open an R, python, matlab, julia or whatever you might be comfortable with and try to make the plot you need.

lychen83 commented 3 years ago

Thanks Arthur,

I run the wdg ksd first, then run the wgd dmd. The command and output are following:

wgd ksd apon20128364.cds.fa_alis16021.cds.fa.rbh apon20128364.cds.fa alis16021.cds.fa -n 5
2020-12-01 17:44:29: INFO       Error: err: option file. add space around the equal sign?.
2020-12-01 17:44:29: INFO       codeml found
2020-12-01 17:44:30: INFO
2020-12-01 17:44:30: INFO
2020-12-01 17:44:30: WARNING    Output directory exists, will possibly overwrite
2020-12-01 17:44:31: INFO       Translating CDS file
100% (41432 of 41432) |########################################################################################################################################################
2020-12-01 17:44:53: WARNING    There were 0 warnings during translation
2020-12-01 17:44:53: INFO       Started whole paranome Ks analysis
2020-12-01 17:44:53: WARNING    Filtered out the 0 largest gene families because n*(n-1)/2 > `max_pairwise`
2020-12-01 17:44:53: WARNING    If you want to analyse these large families anyhow, please raise the `max_pairwise` parameter.
...
2020-12-01 18:37:20: INFO       Analysis done
2020-12-01 18:37:20: INFO       Making results data frame
2020-12-01 18:40:11: INFO       Removing tmp directory
2020-12-01 18:40:14: INFO       Computing weights, outlier cut-off at Ks > 5
2020-12-01 18:40:14: INFO       Generating plots
2020-12-01 18:40:20: INFO       Will plot **node-weighted** histograms
2020-12-01 18:40:27: INFO       Done_

However, I did not find an svg file, which show the ks plot. I have used cds sequences, which do not have stop codon in the middle of sequences and are multiple of 3. All of the cds sequences can be translated to pep without problems. However, some sequences lack the 5' end or 3' end. I got error reminds when I run the wgd dmd. Could you figure out the problem?

Yes, I used the wgd viz for ks plot. I got an error:

wgd viz -i -ks alisorie.cds.fa.blast.tsv,amorkonj.cds.fa.blast.tsv`
2020-12-01 22:21:19: INFO       Not a Ks distribution: alisorie.cds.fa.blast.tsv
2020-12-01 22:21:22: INFO       Not a Ks distribution: amorkonj.cds.fa.blast.tsv
Traceback (most recent call last):
  File "/home/lyc/.local/bin/wgd", line 11, in <module>
    load_entry_point('wgd==1.2', 'console_scripts', 'wgd')()
  File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 1025, in __call__
    return self.main(*args, **kwargs)
  File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 955, in main
    rv = self.invoke(ctx)
  File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 1517, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 1279, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 710, in invoke
    return callback(*args, **kwargs)
  File "/home/lyc/.local/lib/python3.7/site-packages/wgd_cli.py", line 1286, in viz
    output_file, filters, ks_range, bins, interactive, weighted
  File "/home/lyc/.local/lib/python3.7/site-packages/wgd_cli.py", line 1350, in viz_
    histogram_bokeh(dists_files, labels)
  File "/home/lyc/.local/lib/python3.7/site-packages/wgd/viz.py", line 438, in histogram_bokeh
    c = get_colors()
  File "/home/lyc/.local/lib/python3.7/site-packages/wgd/viz.py", line 416, in get_colors
    c = [c[-1]]
IndexError: list index out of range_

In addition, if I want to plot the Ks distribution with R or Python. Can I use the Ks values in *.cds.fa.ks.tsv directly?

Thank you so much.

Best, Lingyun

arzwa commented 3 years ago

Are you sure you are providing the right files to wgd viz? You have to provide the .ks.tsv files. Yes you can use the Ks values in the tsv file directly, but these are not weighted by node in the gene family tree. See for instance #24.

lychen83 commented 3 years ago

Dear Arthur,

Thanks for your help. I have used software wgd to infer the ancient WGD for my species. The result is similar to results inferred from other methods. However, when I use the wgd to infer the location of the split between two species, e.g. A and B (I used command wgd dmd A.cds.fa B.cds.fa and wgd ksd A.cds.fa_B.cds.fa.rbh A.cds.fa B.cds.fa). The split (location, the peak) is much older than other methods. In other words, the WGD was suggested to have happened after the split between A and B when using your software wgd, while the WGD happened before the split of A and B. Most of my analyses supported that WGD happened before the split of A and B. I don't know why the age of split of two species recovered from wgd is older. Is it that the algorithms for within species Ks and between species Ks analyses different? Thank you so much. Best, Lingyun

Example
arzwa commented 3 years ago

I'm afraid I cannot help much with this information, why do you think the WGD is shared, what other methods did you use, etc. etc. Also, are you sure the peak in species B you point at with the arrow is associated with a genome duplication?

lychen83 commented 3 years ago

Dear Arthur,

Thanks for the reply. I used scripts from Stephen A. Smith at UMich's lab for Ks plot (https://bitbucket.org/ningwang83/portulacineae/src/master/). In their methods, BLASTP was used to search for the paralogous pairs, then Ks values were calculated. Based on the Ks plot, I think there might be a WGD at Ks = 0.5, which is similar to results from software wgd. When I use the python multispecies_ks.py in Stephen lab to plot the between species Ks, the peak is around 0.2. I think multispecies_ks.pysearched the orthologous pairs and calculated the Ks values. However, when I use software wgd wgd dmd A.cds.fa B.cds.fa to do the same work, the Ks peak is around 1.8.
In addition, I also mapped gene trees to species tree to infer the WGD. Overall, I think there might be a WGD at Ks = 0.5. However, I am not confident about it.

I am curious if I can use the wgd dmd A.cds.fa B.cds.fa to infer the split between A and B. Thank you for help.

Best, Lingyun

arzwa commented 3 years ago

Yes wgd dmd with two genomes infers one-to-one orthologs, and wgd ksd obtains the Ks estimates for these orthologs. The methods seem to be slightly different but they should not give so different results of course. I don't know why you are seeing those results, I recommend you take an in-depth look at the orthologs inferred in both methods and compare some of the Ks estimates.