Open xvazquezc opened 6 years ago
Hi,
It could be, it should run if there is at least one sample though. I did not write the Phylosift pipeline, Aaron Darling did. However, if you could upload the files outputsel_var.csv and outputtran_df.csv I can at least see if they are valid DESMAN input.
Thanks, Chris
Hi Chris, these are the files. outputtran_df.csv seems fine, but outputsel_var.csv only has the header. The files correspond to one of the clusters, but for what I could see, every cluster folder looks the same.
hi Xabi, Chris just pointed me to your message, sorry I didn't see it earlier. Do you know if your dataset has any variants in these marker genes? It seems like the variant file is empty, possibly because there were no high quality variant sites for it. How big is your dataset and do you have any prior idea of the species & strain complexity of the sample? It's also possible that something else has gone wrong, so if you're sure there are strains we should dig into the steps upstream of where it broke to understand why.
Hi @koadman,
the MAG I'm running has about 1.4 SNVs per kb based on anvi'o's output (0 would indicate a clonal population). I manually searched about 10 of the COGs used on the automated DESMAN workflow and I couldn't find a single SNV, probably because they are too conserved and the strains are too closely related. I guess that it will be the problem.
The metagenome assembly is based on 2 samples and this MAG has an overall coverage of ~200x. Most of the SNVs correspond to ~10% (5-20% range) of the mapped bases. The fact that they don't appear on highly conserved genes and that the proportion of constant bases is quite stable makes me think that the different bases aren't due to sequencing errors.
@chrisquince, is there a way to use the SNV table generated with the --quince-mode
in anvi'o as input for DESMAN? That would probably make things easier for me.
Cheers, Xabi
Hi Xavi,
You can use the standard SNV table from anvi'o with Desman but the output from quince mode is probably better since I believe it contains all base positions. You just need to convert the anvio table into Desman format. I do have a script for that but because the anvi'o fields can be variable it may require tweaking for your case. Can you send me say the first 1000 lines of the anvi'o output and I can test it?
Best, Chris
Thanks Chris, here it is.
Do you have the script available anywhere by any chance?
import sys, getopt import os import pandas as p import numpy as np import scipy.stats as ss import scipy as sp import scipy.misc as spm import math import argparse
from collections import defaultdict
def main(argv): parser = argparse.ArgumentParser()
parser.add_argument("anvio_file", help="ANVIO format frequencies")
args = parser.parse_args()
anvio_file = args.anvio_file
sample_pos = defaultdict(lambda: defaultdict(dict))
samples = set()
genes = set()
posns = defaultdict(set)
with open(args.anvio_file) as f:
next(f)
for line in f:
line = line.rstrip()
tokens = line.split('\t')
sample_id = tokens[3]
gene_id = tokens[29]
pos = int(tokens[4])
if sample_id not in samples:
samples.add(sample_id)
posns[gene_id].add(pos)
genes.add(gene_id)
freqs = (int(tokens[17]),int(tokens[19]), int(tokens[20]),int(tokens[18]))
sample_pos[gene_id][pos][sample_id] = freqs
samples = list(samples)
samples.sort()
genes = list(genes)
genes.sort()
#sString = ",".join(samples)
expanded = []
for name in samples:
expanded.append(name + "-A")
expanded.append(name + "-C")
expanded.append(name + "-G")
expanded.append(name + "-T")
eString = ",".join(expanded)
print "Gene,Posn," + eString
for gene in genes:
posns_gene = list(posns[gene])
posns_gene.sort()
for posn in posns_gene:
outString = gene + "," + repr(posn)
for sample in samples:
if sample in sample_pos[gene][posn]:
freqs = sample_pos[gene][posn][sample]
else:
freqs = (0,0,0,0)
freqs2 = [repr(f) for f in freqs]
fstring = ",".join(freqs2)
outString = outString + "," + fstring
print outString
if name == "main": main(sys.argv[1:])
Hi Chris, oh, I see what happened. The SNV file starts with the SNVs on non-coding regions (~1500 lines).
This one has the header and the last 1000 lines. Bin_1-SNVs_tail.txt
Thanks
Hi,
This script produces DESMAN compatible output but the frequencies look a bit odd. Also without gene calls the test for median coverage employed by DESMAN will not work. To run change extension to .py and then:
python ./ConvertANVIO.py Bin_1-SNVs_tail.txt
Thanks, Chris
Hi Chris,
Thanks for the script. The gene calls are in the corresponding_gene_call
column in the sample data.
Just changing L35 to gene_id = tokens[5]
does the trick.
By the way, should I remove the SNV calls from non-coding regions? Without them I still get 722 genes with a total of >5000 SNV positions.
Cheers, Xabi
I got to the step of running desman
itself but it gets stuck when tries to start the Gibbs burn-in
Terminal:
(desman) $ desman Bin_1-SNVs_outsel_var.csv -g 5 -i 50 -e Bin_1-SNVs_outr_df.csv -o Bin_1-SNVs_out_g5
Up and running. Check /home/xabi/Desktop/SBI_projects/miriam/Bin_1-SNVs_out_g5/log_file.txt for progress
Traceback (most recent call last):
File "/home/xabi/anaconda2/envs/desman/bin/desman", line 246, in <module>
main(sys.argv[1:])
File "/home/xabi/anaconda2/envs/desman/bin/desman", line 149, in main
haplo_SNP.update()
File "/home/xabi/.local/lib/python3.5/site-packages/desman-2.1.1-py3.5-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 336, in update
self.ll = self.logLikelihood(self.gamma,self.tau,self.eta)
File "/home/xabi/.local/lib/python3.5/site-packages/desman-2.1.1-py3.5-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 435, in logLikelihood
probVS = np.einsum('ijk,lj,km->ilm',cTau,cGamma,cEta)
File "/home/xabi/.local/lib/python3.5/site-packages/numpy/core/einsumfunc.py", line 1069, in einsum
return c_einsum(*operands, **kwargs)
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2822,5,4)->(2822,newaxis,newaxis,5,4) (2,5)->(2,newaxis,5,newaxis) (5263,2)->(2,newaxis,5263)
Log:
2018-09-28 10:35:18,063:INFO:root:Results created in /home/xabi/Desktop/SBI_projects/miriam/Bin_1-SNVs_out_g5
2018-09-28 10:35:18,063:INFO:root:Set fixed seed for random position selection = 238329
2018-09-28 10:35:18,074:INFO:root:Running Desman with 2 samples and 2822 variant positions finding 5 genomes.
2018-09-28 10:35:18,074:INFO:root:Set eta error transition matrix from = <_io.TextIOWrapper name='Bin_1-SNVs_outr_df.csv' mode='r' encoding='UTF-8'>
2018-09-28 10:35:18,078:INFO:root:Set second adjustable random seed = 23724839
2018-09-28 10:35:18,086:INFO:root:Perform NTF initialisation
2018-09-28 10:35:18,250:INFO:root:NTF Iter 0, div = 7033.098054
2018-09-28 10:35:23,079:INFO:root:NTF Iter 100, div = 27.149127
2018-09-28 10:35:28,126:INFO:root:NTF Iter 200, div = 26.413665
2018-09-28 10:35:33,046:INFO:root:NTF Iter 300, div = 25.664004
2018-09-28 10:35:37,922:INFO:root:NTF Iter 400, div = 24.269144
2018-09-28 10:35:42,835:INFO:root:NTF Iter 500, div = 22.555659
2018-09-28 10:35:47,722:INFO:root:NTF Iter 600, div = 20.267187
2018-09-28 10:35:52,605:INFO:root:NTF Iter 700, div = 16.463571
2018-09-28 10:35:57,485:INFO:root:NTF Iter 800, div = 12.501173
2018-09-28 10:36:02,373:INFO:root:NTF Iter 900, div = 7.588231
2018-09-28 10:36:07,234:INFO:root:NTF Iter 1000, div = 2.809085
2018-09-28 10:36:12,123:INFO:root:NTF Iter 1100, div = 0.659918
2018-09-28 10:36:17,012:INFO:root:NTF Iter 1200, div = 0.268430
2018-09-28 10:36:21,930:INFO:root:NTF Iter 1300, div = 0.147414
2018-09-28 10:36:26,919:INFO:root:NTF Iter 1400, div = 0.083881
2018-09-28 10:36:31,835:INFO:root:NTF Iter 1500, div = 0.054996
2018-09-28 10:36:36,732:INFO:root:NTF Iter 1600, div = 0.038581
2018-09-28 10:36:41,624:INFO:root:NTF Iter 1700, div = 0.028583
2018-09-28 10:36:46,453:INFO:root:NTF Iter 1800, div = 0.020659
2018-09-28 10:36:51,300:INFO:root:NTF Iter 1900, div = 0.015106
2018-09-28 10:36:56,135:INFO:root:NTF Iter 2000, div = 0.011695
2018-09-28 10:37:00,973:INFO:root:NTF Iter 2100, div = 0.009690
2018-09-28 10:37:05,833:INFO:root:NTF Iter 2200, div = 0.007904
2018-09-28 10:37:10,654:INFO:root:NTF Iter 2300, div = 0.006540
2018-09-28 10:37:15,171:INFO:root:Start Gibbs sampler burn-in phase
Any idea what can be going on? I removed the non-coding SNVs and run like this beforehand:
Variant_Filter.py Bin_1-SNVs_desman-nonc.csv -o Bin_1-SNVs_out -p
Hi Xabi, Can you upload your variants file Bin_1-SNVs_desman-nonc.csv and I will try running it? Thanks, Chris
Here you have. Thank you. Bin_1-SNVs_desman-nonc.txt
Hi @chrisquince I was wondering if you have had time to take a look. Thanks, Xabi
Dear Xabi,
Sorry for the delay responding. The problem is that you are using the wrong input file to DESMAN. Your command above should have been:
desman Bin_1-SNVs_outsel_var.csv -g 5 -i 50 -e Bin_1-SNVs_outtran_df.csv -o Bin_1-SNVs_out_g5
However, DESMAN will have problems resolving haplotypes from just two samples. I ran your data set. Running:
varFile='Bin_1-SNVs_outsel_var.csv'
eFile='Bin_1-SNVs_outtran_df.csv'
for g in 1 2 3 4
do
echo $g
for r in 0 1 2 3 4
do
echo $r
(desman $varFile -e $eFile -o Bin1-SNVs${g}_${r} -g $g -s $r)&
done
wait
done
This gave two haplotypes:
python $DESMAN//scripts/resolvenhap.py Bin_1-SNVs
2,2,0,0.0430988660524,Bin_1-SNVs_2_0/Filtered_Tau_star.csv
,0,1 H2_HCO3,0.18987866167777373,0.8101213383222263 Isoprene,0.18646651165477213,0.8135334883452279
But with no difference between your samples. So I would be cautious interpreting these results.
Thanks, Chris
Thanks Chris, that's more or less what I expected.
Even though I know the output might not be completely reliable, I'm still trying to continue so I can setup my Anvi'o to DESMAN workflow.
Following the tutorials from the Penryn workshop, I'm trying to use the GetVariantsCore.py
script but I'm a bit stuck as I don't use a completely standard input...
From the GetVariantsCore.py
I get that
contig_fasta_file
is a plain fasta with the contigs of my MAGgene_file
should be a csv with the coordinates of the gene callstau_file
is the Filtered_Tau_star.csv
from the best iteration of desman
(?)gene_list
is the list of genes with confirmed variants (the coregenes.txt
file?)What I'm not sure is the structure the gene_file
should have. I have looked into the code of the script and this is a sample of what I get:
,c_000000000039,2911,3135,11,1
,c_000000000039,3132,3263,12,1
,c_000000000041,2,547,13,0
,c_000000000041,684,950,14,0
,c_000000000041,1311,1451,15,0
,c_000000000041,1456,2379,16,0
,c_000000000041,2364,2996,17,0
,c_000000000041,3038,3904,18,0
,c_000000000041,4068,4274,19,0
,c_000000000042,2,223,20,0
,c_000000000042,272,541,21,0
With the columns in this order:
cog,contig,start,end,gene_id,strand
with strand 0 = forward and 1 = reverse (not sure about that).
As my input is not based on COGs I cannot provide one so I had to add the empty column (otherwise it complains).
I tried by including all genes or only the genes included in the coregenes.txt
in the gene_file
with the same output error:
Open contigs.fasta reading fasta
Open ISODEG_MAG_00002-SNVs_2_0/Filtered_Tau_star.csv reading tau
Traceback (most recent call last):
File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1506, in _has_valid_type
error()
File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1501, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1030] is not in the [index]'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xabi/anaconda2/envs/desman/bin/GetVariantsCore.py", line 97, in main
gene_variants = tau_df.loc[gene]
File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1626, in _getitem_axis
self._has_valid_type(key, axis)
File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1514, in _has_valid_type
error()
File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1501, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1030] is not in the [index]'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xabi/anaconda2/envs/desman/bin/GetVariantsCore.py", line 153, in <module>
main(sys.argv[1:])
File "/home/xabi/anaconda2/envs/desman/bin/GetVariantsCore.py", line 122, in main
for g in range(G):
TypeError: 'float' object cannot be interpreted as an integer
It says something about 1030 which is one of the genes, but it is present in all the input files (except the contigs.fasta).
Do you see anything unusual in what I'm doing that needs to be fixed?
Again, thanks a lot. Xabi
Has there been any continuation of this workflow? I would love to use anvi'o SNVs and an anvi'o bin as my starting data.
Hi Xabi,
I am doing that myself at the moment. When I am finished I could commit a workflow for that case? It won't be NextFlow though possibly just a bash script or ideally a snakemake. Have you exported the Anvio SNV profile correctly?
Best, Chris
Has anyone tried to import Desman results into Anvio for further analyses and visualization purposes instead? (don't know if it's worth to open a new thread for that)
Hi there,
after dealing with the Phylosift issue (#28), I was able to run the test data to completion without problem. Now I'm getting an error when DESMAN tries to estimate the strain count, not sure if could be due to the limited number of samples. Any idea?
Thank you in advance, Xabi