Closed lychen83 closed 2 years ago
(1) Yes but you have to make sure to only provide proper coding sequence data, your sequences must be perfectly translatable (i.e. must not contain stop codons, and must be have a length that is a multiple of 3)
(2) Make sure your data is properly translatable (see point 1), because sequences that are not will be discarded. You may also want to try out the --pairwise
flag if the pipeline discards too many families because of families with many gaps in the alignments.
(3) Yes you can use wgd
for that. Note that the interpretation of Ks as time depends on the substitution rates. If the substitution rates differ across lineages examined, you cannot directly compare the Ks distributions, so beware. WIth regard to your issues, you have to run wgd ksd
on the output of the wgd dmd
command, the latter only identifies reciprocal best hits as putative orthologs.
(4) You can use wgd viz
to plot multiple distributions. Alternatively you can open an R
, python
, matlab
, julia
or whatever you might be comfortable with and try to make the plot you need.
Thanks Arthur,
I run the wdg ksd
first, then run the wgd dmd
.
The command and output are following:
wgd ksd apon20128364.cds.fa_alis16021.cds.fa.rbh apon20128364.cds.fa alis16021.cds.fa -n 5
2020-12-01 17:44:29: INFO Error: err: option file. add space around the equal sign?.
2020-12-01 17:44:29: INFO codeml found
2020-12-01 17:44:30: INFO
2020-12-01 17:44:30: INFO
2020-12-01 17:44:30: WARNING Output directory exists, will possibly overwrite
2020-12-01 17:44:31: INFO Translating CDS file
100% (41432 of 41432) |########################################################################################################################################################
2020-12-01 17:44:53: WARNING There were 0 warnings during translation
2020-12-01 17:44:53: INFO Started whole paranome Ks analysis
2020-12-01 17:44:53: WARNING Filtered out the 0 largest gene families because n*(n-1)/2 > `max_pairwise`
2020-12-01 17:44:53: WARNING If you want to analyse these large families anyhow, please raise the `max_pairwise` parameter.
...
2020-12-01 18:37:20: INFO Analysis done
2020-12-01 18:37:20: INFO Making results data frame
2020-12-01 18:40:11: INFO Removing tmp directory
2020-12-01 18:40:14: INFO Computing weights, outlier cut-off at Ks > 5
2020-12-01 18:40:14: INFO Generating plots
2020-12-01 18:40:20: INFO Will plot **node-weighted** histograms
2020-12-01 18:40:27: INFO Done_
However, I did not find an svg file, which show the ks plot. I have used cds sequences, which do not have stop codon in the middle of sequences and are multiple of 3. All of the cds sequences can be translated to pep without problems. However, some sequences lack the 5' end or 3' end. I got error reminds when I run the wgd dmd
.
Could you figure out the problem?
Yes, I used the wgd viz
for ks plot.
I got an error:
wgd viz -i -ks alisorie.cds.fa.blast.tsv,amorkonj.cds.fa.blast.tsv`
2020-12-01 22:21:19: INFO Not a Ks distribution: alisorie.cds.fa.blast.tsv
2020-12-01 22:21:22: INFO Not a Ks distribution: amorkonj.cds.fa.blast.tsv
Traceback (most recent call last):
File "/home/lyc/.local/bin/wgd", line 11, in <module>
load_entry_point('wgd==1.2', 'console_scripts', 'wgd')()
File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 1025, in __call__
return self.main(*args, **kwargs)
File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 955, in main
rv = self.invoke(ctx)
File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 1517, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 1279, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/lyc/software/python_3.7.6_installed/lib/python3.7/site-packages/click-8.0.0a1-py3.7.egg/click/core.py", line 710, in invoke
return callback(*args, **kwargs)
File "/home/lyc/.local/lib/python3.7/site-packages/wgd_cli.py", line 1286, in viz
output_file, filters, ks_range, bins, interactive, weighted
File "/home/lyc/.local/lib/python3.7/site-packages/wgd_cli.py", line 1350, in viz_
histogram_bokeh(dists_files, labels)
File "/home/lyc/.local/lib/python3.7/site-packages/wgd/viz.py", line 438, in histogram_bokeh
c = get_colors()
File "/home/lyc/.local/lib/python3.7/site-packages/wgd/viz.py", line 416, in get_colors
c = [c[-1]]
IndexError: list index out of range_
In addition, if I want to plot the Ks distribution with R or Python. Can I use the Ks values in *.cds.fa.ks.tsv directly?
Thank you so much.
Best, Lingyun
Are you sure you are providing the right files to wgd viz
? You have to provide the .ks.tsv
files. Yes you can use the Ks values in the tsv file directly, but these are not weighted by node in the gene family tree. See for instance #24.
Dear Arthur,
Thanks for your help.
I have used software wgd to infer the ancient WGD for my species. The result is similar to results inferred from other methods.
However, when I use the wgd to infer the location of the split between two species, e.g. A and B (I used command wgd dmd A.cds.fa B.cds.fa
and wgd ksd A.cds.fa_B.cds.fa.rbh A.cds.fa B.cds.fa
). The split (location, the peak) is much older than other methods. In other words, the WGD was suggested to have happened after the split between A and B when using your software wgd, while the WGD happened before the split of A and B. Most of my analyses supported that WGD happened before the split of A and B.
I don't know why the age of split of two species recovered from wgd is older. Is it that the algorithms for within species Ks and between species Ks analyses different?
Thank you so much.
Best,
Lingyun
I'm afraid I cannot help much with this information, why do you think the WGD is shared, what other methods did you use, etc. etc. Also, are you sure the peak in species B you point at with the arrow is associated with a genome duplication?
Dear Arthur,
Thanks for the reply. I used scripts from Stephen A. Smith at UMich's lab for Ks plot (https://bitbucket.org/ningwang83/portulacineae/src/master/). In their methods, BLASTP was used to search for the paralogous pairs, then Ks values were calculated. Based on the Ks plot, I think there might be a WGD at Ks = 0.5, which is similar to results from software wgd. When I use the python multispecies_ks.py
in Stephen lab to plot the between species Ks, the peak is around 0.2. I think multispecies_ks.py
searched the orthologous pairs and calculated the Ks values. However, when I use software wgd wgd dmd A.cds.fa B.cds.fa
to do the same work, the Ks peak is around 1.8.
In addition, I also mapped gene trees to species tree to infer the WGD. Overall, I think there might be a WGD at Ks = 0.5. However, I am not confident about it.
I am curious if I can use the wgd dmd A.cds.fa B.cds.fa
to infer the split between A and B.
Thank you for help.
Best, Lingyun
Yes wgd dmd
with two genomes infers one-to-one orthologs, and wgd ksd
obtains the Ks estimates for these orthologs. The methods seem to be slightly different but they should not give so different results of course. I don't know why you are seeing those results, I recommend you take an in-depth look at the orthologs inferred in both methods and compare some of the Ks estimates.
Dear Arthur,
I have several questions about the wgd. 1), can I use the wgd for my de novo transcriptome data? 2), I have tried wgd for my denovo transcriptome data. The maximum numbers for duplications of the samples are around 100 - 350. Are they too low? 3), can I use the wgd for estimating the between species Ks plot. I have tried the command 'wgd dmd spcies1.cds.fa species2. cds.fa'. I suppose the Ks peak recovered from analysis should correspond to the split between the species1 and species2. Then, comparing the location of Ks peak (a WGD event) of species1, Ks peak of species2 (a WGD event), and the Ks peak between the two species, I can know if WGD events happened before or after the split of the two species. Is my understanding right? However, after running the command 'wgd dmd spcies1.cds.fa species2. cds.fa', I did not get an svg file, which may show the Ks distribution. What additional steps need? 4), I want to plot the Ks distribution for multiple species in one figure (e.g. the Fig. 1a in your Zwaenepoel1 and de Peer (2019) paper). How can I achieve it with my Linux workstation?
I appreciate for help. Best,
Lingyun