Closed amit4mchiba closed 5 years ago
By the way, these are the all files that were resulted during running second step but it remains the same for last 10 hours with no new files or summary or anything else. I am wondring if I need to use .ks file and create ks.tsv for next stage. Please advice me.
amit8chiba@amit8chiba-Precision-Tower-7910:/mnt/md0/Opu_r1.2_final/Comparitive_genomics/wgd/run_results/Opu_Ksd_analysis/ks_tmp.371e05d6b24ca6$ ls
GF_000010.fasta GF_000522.fasta GF_001046.fasta GF_001570.fasta GF_002091.Ks GF_002607.Ks GF_003127.fasta GF_003651.fasta
GF_000010.Ks GF_000522.Ks GF_001046.Ks GF_001570.Ks GF_002092.fasta GF_002608.fasta GF_003127.Ks GF_003651.Ks
GF_000011.fasta GF_000523.fasta GF_001047.fasta GF_001571.fasta GF_002092.Ks GF_002608.Ks GF_003128.fasta GF_003652.fasta
GF_000011.Ks GF_000523.Ks GF_001047.Ks GF_001571.Ks GF_002093.fasta GF_002609.fasta GF_003128.Ks GF_003652.Ks
GF_000012.fasta GF_000524.fasta GF_001048.fasta GF_001572.fasta GF_002093.Ks GF_002609.Ks GF_003129.fasta GF_003653.fasta
GF_000012.Ks GF_000524.Ks GF_001048.Ks GF_001572.Ks GF_002094.fasta GF_002610.fasta GF_003129.Ks GF_003653.Ks
GF_000013.fasta GF_000525.fasta GF_001049.fasta GF_001573.fasta GF_002094.Ks GF_002610.Ks GF_003130.fasta GF_003654.fasta
GF_000013.fasta.msa GF_000525.Ks GF_001049.Ks GF_001573.Ks GF_002095.fasta GF_002611.fasta GF_003130.Ks GF_003654.Ks
GF_000013.fasta.msa.phyml GF_000526.fasta GF_001050.fasta GF_001574.fasta GF_002095.Ks GF_002611.Ks GF_003131.fasta GF_003655.fasta
GF_000013.fasta.msa.phyml_phyml_stats.txt GF_000526.Ks GF_001050.Ks GF_001574.Ks GF_002096.fasta GF_002612.fasta GF_003131.Ks GF_003655.Ks
GF_000013.fasta.msa.phyml_phyml_tree.txt GF_000527.fasta GF_001051.fasta GF_001575.fasta GF_002096.Ks GF_002612.Ks GF_003132.fasta GF_003656.fasta
GF_000014.fasta GF_000527.Ks GF_001051.Ks GF_001575.Ks GF_002097.fasta GF_002613.fasta GF_003132.Ks GF_003656.Ks
GF_000014.fasta.msa GF_000528.fasta GF_001052.fasta GF_001576.fasta GF_002097.Ks GF_002613.Ks GF_003133.fasta GF_003657.fasta
GF_000014.fasta.msa.phyml GF_000528.Ks GF_001052.Ks GF_001576.Ks GF_002098.fasta GF_002614.fasta GF_003133.Ks GF_003657.Ks
GF_000014.fasta.msa.phyml_phyml_stats.txt GF_000529.fasta GF_001053.fasta GF_001577.fasta GF_002098.Ks GF_002614.Ks GF_003134.fasta GF_003658.fasta
GF_000014.fasta.msa.phyml_phyml_tree.txt GF_000529.Ks GF_001053.Ks GF_001577.Ks GF_002099.fasta GF_002615.fasta GF_003134.Ks GF_003658.Ks
GF_000015.fasta GF_000530.fasta GF_001054.fasta GF_001578.fasta GF_002099.Ks GF_002615.Ks GF_003135.fasta GF_003659.fasta
GF_000015.Ks GF_000530.Ks GF_001054.Ks GF_001578.Ks GF_002100.fasta GF_002616.fasta GF_003135.Ks GF_003659.Ks
GF_000016.fasta GF_000531.fasta GF_001055.fasta GF_001579.fasta GF_002100.Ks GF_002616.Ks GF_003136.fasta GF_003660.fasta
GF_000016.Ks GF_000531.Ks GF_001055.Ks GF_001579.Ks GF_002101.fasta GF_002617.fasta GF_003136.Ks GF_003660.Ks
GF_000017.fasta GF_000532.fasta GF_001056.fasta GF_001580.fasta GF_002101.Ks GF_002617.Ks GF_003137.fasta GF_003661.fasta
GF_000017.fasta.msa GF_000532.Ks GF_001056.Ks GF_001580.Ks GF_002102.fasta GF_002618.fasta GF_003137.Ks GF_003661.Ks
GF_000017.fasta.msa.phyml GF_000533.fasta GF_001057.fasta GF_001581.fasta GF_002102.Ks GF_002618.Ks GF_003138.fasta GF_003662.fasta
GF_000017.fasta.msa.phyml_phyml_stats.txt GF_000533.Ks GF_001057.Ks GF_001581.Ks GF_002103.fasta GF_002619.fasta GF_003138.Ks GF_003662.Ks
GF_000017.fasta.msa.phyml_phyml_tree.txt GF_000534.fasta GF_001058.fasta GF_001582.fasta GF_002103.Ks GF_002619.Ks GF_003139.fasta GF_003663.fasta
GF_000018.fasta GF_000534.Ks GF_001058.Ks GF_001582.Ks GF_002104.fasta GF_002620.fasta GF_003139.Ks GF_003663.Ks
GF_000018.fasta.msa GF_000535.fasta GF_001059.fasta GF_001583.fasta GF_002104.Ks GF_002620.Ks GF_003140.fasta GF_003664.fasta
GF_000018.fasta.msa.phyml GF_000535.Ks GF_001059.Ks GF_001583.Ks GF_002105.fasta GF_002621.fasta GF_003140.Ks GF_003664.Ks
GF_000018.fasta.msa.phyml_phyml_stats.txt GF_000536.fasta GF_001060.fasta GF_001584.fasta GF_002105.Ks GF_002621.Ks GF_003141.fasta GF_003665.fasta
GF_000018.fasta.msa.phyml_phyml_tree.txt GF_000536.Ks GF_001060.Ks GF_001584.Ks GF_002106.fasta GF_002622.fasta GF_003141.Ks GF_003665.Ks
GF_000019.fasta GF_000537.fasta GF_001061.fasta GF_001585.fasta GF_002106.fasta.msa GF_002622.Ks GF_003142.fasta GF_003666.fasta
GF_000019.fasta.msa GF_000537.Ks GF_001061.Ks GF_001585.Ks GF_002106.fasta.msa.nw GF_002623.fasta GF_003142.Ks GF_003666.Ks
GF_000019.fasta.msa.phyml GF_000538.fasta GF_001062.fasta GF_001586.fasta GF_002107.fasta GF_002623.fasta.msa GF_003143.fasta GF_003667.fasta
GF_000019.fasta.msa.phyml_phyml_stats.txt GF_000538.Ks GF_001062.Ks GF_001586.Ks GF_002107.Ks GF_002623.fasta.msa.nw GF_003143.Ks GF_003667.Ks
GF_000019.fasta.msa.phyml_phyml_tree.txt GF_000539.fasta GF_001063.fasta GF_001587.fasta GF_002108.fasta GF_002624.fasta GF_003144.fasta GF_003668.fasta
GF_000020.fasta GF_000539.Ks GF_001063.Ks GF_001587.Ks GF_002108.Ks GF_002624.Ks GF_003144.Ks GF_003668.Ks
GF_000020.fasta.msa GF_000540.fasta GF_001064.fasta GF_001588.fasta GF_002109.fasta GF_002625.fasta GF_003145.fasta GF_003669.fasta
GF_000020.fasta.msa.phyml GF_000540.Ks GF_001064.Ks GF_001588.Ks GF_002109.Ks GF_002625.Ks GF_003145.Ks GF_003669.Ks
GF_000020.fasta.msa.phyml_phyml_stats.txt GF_000541.fasta GF_001065.fasta GF_001589.fasta GF_002110.fasta GF_002626.fasta GF_003146.fasta GF_003670.fasta
GF_000020.fasta.msa.phyml_phyml_tree.txt GF_000541.Ks GF_001065.Ks GF_001589.Ks GF_002110.Ks GF_002626.Ks GF_003146.Ks GF_003670.Ks
GF_000021.fasta GF_000542.fasta GF_001066.fasta GF_001590.fasta GF_002111.fasta GF_002627.fasta GF_003147.fasta GF_003671.fasta
GF_000021.fasta.msa GF_000542.Ks GF_001066.Ks GF_001590.Ks GF_002111.Ks GF_002627.fasta.msa GF_003147.Ks GF_003671.Ks
GF_000021.fasta.msa.phyml GF_000543.fasta GF_001067.fasta GF_001591.fasta GF_002112.fasta GF_002627.fasta.msa.nw GF_003148.fasta GF_003672.fasta
GF_000021.fasta.msa.phyml_phyml_stats.txt GF_000543.Ks GF_001067.Ks GF_001591.Ks GF_002112.Ks GF_002628.fasta GF_003148.Ks GF_003672.Ks
GF_000021.fasta.msa.phyml_phyml_tree.txt GF_000544.fasta GF_001068.fasta GF_001592.fasta GF_002113.fasta GF_002628.Ks GF_003149.fasta GF_003673.fasta
GF_000022.fasta GF_000544.Ks GF_001068.Ks GF_001592.Ks GF_002113.Ks GF_002629.fasta GF_003149.Ks GF_003673.Ks
GF_000022.Ks GF_000545.fasta GF_001069.fasta GF_001593.fasta GF_002114.fasta GF_002629.Ks GF_003150.fasta GF_003674.fasta
GF_000023.fasta GF_000545.Ks GF_001069.Ks GF_001593.Ks GF_002114.Ks GF_002630.fasta GF_003150.Ks GF_003674.Ks
GF_000023.Ks GF_000546.fasta GF_001070.fasta GF_001594.fasta GF_002115.fasta GF_002630.Ks GF_003151.fasta GF_003675.fasta
GF_000024.fasta GF_000546.Ks GF_001070.Ks GF_001594.Ks GF_002115.Ks GF_002631.fasta GF_003151.Ks GF_003675.Ks
GF_000024.Ks GF_000547.fasta GF_001071.fasta GF_001595.fasta GF_002116.fasta GF_002631.Ks GF_003152.fasta GF_003676.fasta
GF_000025.fasta GF_000547.Ks GF_001071.Ks GF_001595.Ks GF_002116.Ks GF_002632.fasta GF_003152.Ks GF_003676.Ks
GF_000025.fasta.msa GF_000548.fasta GF_001072.fasta GF_001596.fasta GF_002117.fasta GF_002632.Ks GF_003153.fasta GF_003677.fasta
GF_000025.fasta.msa.phyml GF_000548.Ks GF_001072.Ks GF_001596.Ks GF_002117.Ks GF_002633.fasta GF_003153.Ks GF_003677.Ks
GF_000025.fasta.msa.phyml_phyml_stats.txt GF_000549.fasta GF_001073.fasta GF_001597.fasta GF_002118.fasta GF_002633.Ks GF_003154.fasta GF_003678.fasta
GF_000025.fasta.msa.phyml_phyml_tree.txt GF_000549.Ks GF_001073.Ks GF_001597.Ks GF_002118.Ks GF_002634.fasta GF_003154.Ks GF_003678.Ks
GF_000026.fasta GF_000550.fasta GF_001074.fasta GF_001598.fasta GF_002119.fasta GF_002634.Ks GF_003155.fasta GF_003679.fasta
GF_000026.Ks GF_000550.Ks GF_001074.Ks GF_001598.Ks GF_002119.Ks GF_002635.fasta GF_003155.Ks GF_003679.Ks
GF_000027.fasta GF_000551.fasta GF_001075.fasta GF_001599.fasta GF_002120.fasta GF_002635.Ks GF_003156.fasta GF_003680.fasta
GF_000027.Ks GF_000551.Ks GF_001075.Ks GF_001599.Ks GF_002120.Ks GF_002636.fasta GF_003156.Ks GF_003680.Ks
GF_000028.fasta GF_000552.fasta GF_001076.fasta GF_001600.fasta GF_002121.fasta GF_002636.Ks GF_003157.fasta GF_003681.fasta
GF_000028.Ks GF_000552.Ks GF_001076.Ks GF_001600.Ks GF_002121.Ks GF_002637.fasta GF_003157.Ks GF_003681.Ks
GF_000029.fasta GF_000553.fasta GF_001077.fasta GF_001601.fasta GF_002122.fasta GF_002637.Ks GF_003158.fasta GF_003682.fasta
GF_000029.Ks GF_000553.Ks GF_001077.Ks GF_001601.Ks GF_002122.Ks GF_002638.fasta GF_003158.Ks GF_003682.Ks
GF_000030.fasta GF_000554.fasta GF_001078.fasta GF_001602.fasta GF_002123.fasta GF_002638.Ks GF_003159.fasta GF_003683.fasta
GF_000030.Ks GF_000554.Ks GF_001078.Ks GF_001602.Ks GF_002123.Ks GF_002639.fasta GF_003159.Ks GF_003683.Ks
GF_000031.fasta GF_000555.fasta GF_001079.fasta GF_001603.fasta GF_002124.fasta GF_002639.Ks GF_003160.fasta GF_003684.fasta
GF_000031.Ks GF_000555.Ks GF_001079.Ks GF_001603.Ks GF_002124.Ks GF_002640.fasta GF_003160.Ks GF_003684.Ks
GF_000032.fasta GF_000556.fasta GF_001080.fasta GF_001604.fasta GF_002125.fasta GF_002640.Ks GF_003161.fasta GF_003685.fasta
GF_000032.Ks GF_000556.Ks GF_001080.Ks GF_001604.Ks GF_002125.Ks GF_002641.fasta GF_003161.Ks GF_003685.Ks
GF_000033.fasta GF_000557.fasta GF_001081.fasta GF_001605.fasta
Hi, as you might have realized, the gene families for which there is a .Ks
file are successfully analyzed. For the large families the analysis seems not to have finished, this might just be because of their size (tree inference and codeml taking a very long time).
I would advise you to run the analysis with fasttree
(default) for tree inference. The trees are only used for weighting the Ks values, and a perfectly accurate tree is therefore not really necessary, FastTree will do a good job for this purpose. Also it is not really useful to use the --pairwise
flag (it is slower and not better).
So my advise would be to just use the following command
wgd ksd -o ./ -n32 ../genome_cds.mcl ../genome_cds.fa
If you want to re-use the results you already obtained, you can point wgd
to the tmp directory using the -tmp
option.
Hope that helps, please let me know if it does.
Thank you so much for your advice. It worked but now I am stuck at the next stage. I wanted to run wgd mix command and used output from wgd ksd.
Here is what I did-
(py3) amit8chiba@amit8chiba-Precision-Tower-7910:$ wgd mix genome_cds.fa.ks.tsv -n 1 5
2019-03-04 02:09:26: INFO Preparing data frame
Traceback (most recent call last):
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'AlignmentCoverage'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/amit8chiba/miniconda2/envs/py3/bin/wgd", line 11, in <module>
sys.exit(cli())
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/wgd_cli.py", line 1018, in mix
output_dir, gamma, n_init, max_iter
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/wgd_cli.py", line 1060, in mix_
ks_range[0], ks_range[1])
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/wgd/modeling.py", line 56, in filter_group_data
df = df[df["AlignmentCoverage"] >= aln_cov]
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/pandas/core/frame.py", line 2927, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/amit8chiba/miniconda2/envs/py3/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2658, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'AlignmentCoverage'
I am not sure what happened here. I checked and I think I have all dependencies. I also checked and seems I have panda installed (version-0.24.1).
Please advice me. I think I am almost there but still stuck.
I have another problem. I wanted to also run collinearity based analysis, and for which I run command as said in the mannual. There were no error and I got many files and results. But the dot plot is empty and it seems it did not identify any collineraity. This is little strange since I can get results using MCScanx. For gff file, I checked Arabidopsis example gff file and mine is exactly the same format. So, I am not sure how shall i proceed there.
thank you so much in advance.
Hi,
I am not sure why the mix
command is failing, it should definitely not give an error like that unless you Ks distribution file is incorrectly formatted or empty... Could you send the first 20 lines or so from genome_cds.fa.ks.tsv
?
For the co-linearity analysis, the issue could be that the --gene_attribute
and --feature
options are not correctly set. These specify where to look in the GFF for the genes and their names. So for example, if you're GFF looks like this:
scaffold_97 JGI v3.3 gene 385 546 . - . ID=Pp3s97_10;pacid=32918799;name=Pp3s97_10V3.1;tid=PAC:32918799
scaffold_97 JGI v3.3 mRNA 385 546 . - . ID=Pp3s97_10V3.1;Parent=Pp3s97_10
scaffold_97 JGI v3.3 exon 385 546 . - . ID=Pp3s97_10V3.1.exon.1;Parent=Pp3s97_10V3.1
scaffold_97 JGI v3.3 CDS 385 546 . - . ID=Pp3s97_10V3.1.CDS.1;Parent=Pp3s97_10V3.1
And your gene IDs in the Ks distributions and CDS fasta files like Pp3s97_10
, you would need to set --feature gene
(third column info to use) and --gene_attribute ID
(attribute name in the last column that refers to the correct gene ID) in your command. Alternatively for this example, you could use --feature mRNA
and --gene_attribute Parent
. Not sure if that is your problem though...
Thank you so much for your reply.
I really do not know why It did not work last time, but I decided to run the whole thing once again and it worked as expected. So, I have no clue what was wrong last time. I was able to get plots as expected from the mannual, although I am trying to now understand the interpretation. Do you have any recommendation paper to link to understand it. I can see two peaks in my Ks plot but do not know how to interpret it in terms of if gene duplication happened, and if yes then when and so on. I am attaching plots, do they look normal?
About wgd syn, you were right. The issue was gff and difference in id naming. I then used mRNA as -f and -a as ID, and it worked. I was able to get expected plots. genome.mcl.ks_anchors.zip
Hi, I'm closing this issue and I would prefer to continue discussing results etc. via email, I'd like to keep the GitHub issues for software problems only.
Hi,
I am writing here to seek your advice on using wgd for my study.
Based on the paper and manual, I first got my genome_cds.fasta, and then used this command-
This resulted in genome_cds.mcl file as output. I used this output file to ruin next step as follows-
this is the part of running process-
This step resulted in several files in temp file but it has been almost 12 hours but output file has not generated. It seems the program is stuck as no new files are being generated but I can not see any error. I am wondering if this is expected time. My genome size is 400Mb, and got 35000 genes in it.
Please let me know if you need any further information in order to help me out here.
Thank you so much in advance,
with best regards Amit