chrisquince / DESMAN

De novo Extraction of Strains from MetAgeNomes
Other
69 stars 22 forks source link

ERROR estimateStrainCountDesman (3) with desmanflow #30

Open xvazquezc opened 6 years ago

xvazquezc commented 6 years ago

Hi there,

after dealing with the Phylosift issue (#28), I was able to run the test data to completion without problem. Now I'm getting an error when DESMAN tries to estimate the strain count, not sure if could be due to the limited number of samples. Any idea?

Thank you in advance, Xabi

$ ~/DESMAN/scripts/desmanflow.nf --speciescontigs=contig_list.txt --assembly=../../final.contigs.1000.fa --inputreads=../../../../fastq/
N E X T F L O W  ~  version 0.31.1
Launching `/home/xabi/DESMAN/scripts/desmanflow.nf` [cranky_varahamihira] - revision: 46958c6af8
[h2hco3, [/home/xabi/Desktop/SBI_projects/miriam/clive/fastq/h2hco3.r1.fq.gz, /home/xabi/Desktop/SBI_projects/miriam/clive/fastq/h2hco3.r2.fq.gz]]
[isoprene, [/home/xabi/Desktop/SBI_projects/miriam/clive/fastq/isoprene.r1.fq.gz, /home/xabi/Desktop/SBI_projects/miriam/clive/fastq/isoprene.r2.fq.gz]]
[warm up] executor > local
[6b/70844a] Submitted process > mapReads (1)
[02/5f5bb8] Submitted process > mapReads (2)
[86/6f4b77] Submitted process > findEliteGenes
[01/5a43f8] Submitted process > elitePileups (1)
[4e/32277b] Submitted process > elitePileups (2)
[66/107860] Submitted process > callEliteVariants
[b0/7ae068] Submitted process > estimateStrainCountDesman (1)
[53/028d56] Submitted process > estimateStrainCountDesman (2)
[5e/a2e284] Submitted process > estimateStrainCountDesman (4)
[06/a14411] Submitted process > estimateStrainCountDesman (3)
[a4/00e91f] Submitted process > estimateStrainCountDesman (5)
[14/6a2e2a] Submitted process > estimateStrainCountDesman (7)
[24/60858b] Submitted process > estimateStrainCountDesman (8)
[18/09f61f] Submitted process > estimateStrainCountDesman (6)
[a7/5977de] Submitted process > estimateStrainCountDesman (13)
ERROR ~ Error executing process > 'estimateStrainCountDesman (3)'

Caused by:
  Process `estimateStrainCountDesman (3)` terminated with an error exit status (1)

Command executed:

  /home/xabi/DESMAN/bin/desman outputsel_var.csv -e outputtran_df.csv -o cluster_4_1 -r 1000 -i 100 -g 4 -s 1 > cluster_4_1.out
  cp */fit.txt fit_4_1.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  Up and running. Check cluster_4_1/log_file.txt for progress
  /home/xabi/.local/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
    out=out, **kwargs)
  /home/xabi/.local/lib/python3.5/site-packages/numpy/core/_methods.py:73: RuntimeWarning: invalid value encountered in true_divide
    ret, rcount, out=ret, casting='unsafe', subok=False)
  cp: cannot stat '*/fit.txt': No such file or directory

Work dir:
  /home/xabi/Desktop/SBI_projects/miriam/clive/megahit_coassembly/anvio/MAGs/Bin_1/work/06/a14411cfc1fc6ce73eaf28658394b9

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
WARN: Killing pending tasks (4)
chrisquince commented 6 years ago

Hi,

It could be, it should run if there is at least one sample though. I did not write the Phylosift pipeline, Aaron Darling did. However, if you could upload the files outputsel_var.csv and outputtran_df.csv I can at least see if they are valid DESMAN input.

Thanks, Chris

xvazquezc commented 6 years ago

Hi Chris, these are the files. outputtran_df.csv seems fine, but outputsel_var.csv only has the header. The files correspond to one of the clusters, but for what I could see, every cluster folder looks the same.

outputsel_var.csv.gz outputtran_df.csv.gz

koadman commented 6 years ago

hi Xabi, Chris just pointed me to your message, sorry I didn't see it earlier. Do you know if your dataset has any variants in these marker genes? It seems like the variant file is empty, possibly because there were no high quality variant sites for it. How big is your dataset and do you have any prior idea of the species & strain complexity of the sample? It's also possible that something else has gone wrong, so if you're sure there are strains we should dig into the steps upstream of where it broke to understand why.

xvazquezc commented 6 years ago

Hi @koadman,

the MAG I'm running has about 1.4 SNVs per kb based on anvi'o's output (0 would indicate a clonal population). I manually searched about 10 of the COGs used on the automated DESMAN workflow and I couldn't find a single SNV, probably because they are too conserved and the strains are too closely related. I guess that it will be the problem.

The metagenome assembly is based on 2 samples and this MAG has an overall coverage of ~200x. Most of the SNVs correspond to ~10% (5-20% range) of the mapped bases. The fact that they don't appear on highly conserved genes and that the proportion of constant bases is quite stable makes me think that the different bases aren't due to sequencing errors.

@chrisquince, is there a way to use the SNV table generated with the --quince-mode in anvi'o as input for DESMAN? That would probably make things easier for me.

Cheers, Xabi

chrisquince commented 6 years ago

Hi Xavi,

You can use the standard SNV table from anvi'o with Desman but the output from quince mode is probably better since I believe it contains all base positions. You just need to convert the anvio table into Desman format. I do have a script for that but because the anvi'o fields can be variable it may require tweaking for your case. Can you send me say the first 1000 lines of the anvi'o output and I can test it?

Best, Chris

xvazquezc commented 6 years ago

Thanks Chris, here it is.

Do you have the script available anywhere by any chance?

Bin_1-SNVs_head.txt

chrisquince commented 6 years ago

Hi Xabi, This is the script but it will likely need adapting. I should write a generalised version. Let me look at this. There are no gene calls on this genome then? Thanks, Chris

entry_id unique_pos_identifier gene_length sample_id pos pos_in_contig corresponding_gene_call in_partial_gene_call in_complete_gene_call base_pos_in_codon codon_order_in_gene coverage cov_outlier_in_split cov_outlier_in_contig departure_from_reference competing_nts reference A T C G N consensus departure_from_consensus n2n1ratio

import sys, getopt import os import pandas as p import numpy as np import scipy.stats as ss import scipy as sp import scipy.misc as spm import math import argparse

from collections import defaultdict

def main(argv): parser = argparse.ArgumentParser()

parser.add_argument("anvio_file", help="ANVIO format frequencies")

args = parser.parse_args()

anvio_file = args.anvio_file

import ipdb; ipdb.set_trace()

sample_pos = defaultdict(lambda: defaultdict(dict))
samples = set()
genes = set()
posns = defaultdict(set)
with open(args.anvio_file) as f:
    next(f)        
    for line in f:
        line = line.rstrip()
        tokens = line.split('\t')
        sample_id = tokens[3]
        gene_id = tokens[29]
        pos = int(tokens[4])
        if sample_id not in samples:
            samples.add(sample_id)

        posns[gene_id].add(pos)
        genes.add(gene_id)
        freqs = (int(tokens[17]),int(tokens[19]), int(tokens[20]),int(tokens[18]))
        sample_pos[gene_id][pos][sample_id] = freqs

samples = list(samples)
samples.sort()
genes = list(genes)
genes.sort()

#sString = ",".join(samples)

expanded = []
for name in samples:
    expanded.append(name + "-A")
    expanded.append(name + "-C")
    expanded.append(name + "-G")
    expanded.append(name + "-T")

eString = ",".join(expanded)
print "Gene,Posn," + eString
for gene in genes:
    posns_gene = list(posns[gene])    
    posns_gene.sort()

    for posn in posns_gene:
        outString = gene + "," + repr(posn) 
        for sample in samples: 
            if sample in sample_pos[gene][posn]:
                freqs = sample_pos[gene][posn][sample]
            else:
                freqs = (0,0,0,0)
            freqs2 = [repr(f) for f in freqs]
            fstring = ",".join(freqs2)
            outString = outString + "," + fstring
        print outString

if name == "main": main(sys.argv[1:])

xvazquezc commented 6 years ago

Hi Chris, oh, I see what happened. The SNV file starts with the SNVs on non-coding regions (~1500 lines).

This one has the header and the last 1000 lines. Bin_1-SNVs_tail.txt

Thanks

chrisquince commented 6 years ago

Hi,

This script produces DESMAN compatible output but the frequencies look a bit odd. Also without gene calls the test for median coverage employed by DESMAN will not work. To run change extension to .py and then:

python ./ConvertANVIO.py Bin_1-SNVs_tail.txt

Thanks, Chris

ConvertANVIO.txt

xvazquezc commented 6 years ago

Hi Chris, Thanks for the script. The gene calls are in the corresponding_gene_call column in the sample data.

Just changing L35 to gene_id = tokens[5] does the trick.

By the way, should I remove the SNV calls from non-coding regions? Without them I still get 722 genes with a total of >5000 SNV positions.

Cheers, Xabi

xvazquezc commented 6 years ago

I got to the step of running desman itself but it gets stuck when tries to start the Gibbs burn-in Terminal:

(desman) $ desman Bin_1-SNVs_outsel_var.csv -g 5 -i 50 -e Bin_1-SNVs_outr_df.csv -o Bin_1-SNVs_out_g5 
Up and running. Check /home/xabi/Desktop/SBI_projects/miriam/Bin_1-SNVs_out_g5/log_file.txt for progress
Traceback (most recent call last):
  File "/home/xabi/anaconda2/envs/desman/bin/desman", line 246, in <module>
    main(sys.argv[1:])
  File "/home/xabi/anaconda2/envs/desman/bin/desman", line 149, in main
    haplo_SNP.update()
  File "/home/xabi/.local/lib/python3.5/site-packages/desman-2.1.1-py3.5-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 336, in update
    self.ll = self.logLikelihood(self.gamma,self.tau,self.eta)
  File "/home/xabi/.local/lib/python3.5/site-packages/desman-2.1.1-py3.5-linux-x86_64.egg/desman/HaploSNP_Sampler.py", line 435, in logLikelihood
    probVS = np.einsum('ijk,lj,km->ilm',cTau,cGamma,cEta)
  File "/home/xabi/.local/lib/python3.5/site-packages/numpy/core/einsumfunc.py", line 1069, in einsum
    return c_einsum(*operands, **kwargs)
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2822,5,4)->(2822,newaxis,newaxis,5,4) (2,5)->(2,newaxis,5,newaxis) (5263,2)->(2,newaxis,5263) 

Log:

2018-09-28 10:35:18,063:INFO:root:Results created in /home/xabi/Desktop/SBI_projects/miriam/Bin_1-SNVs_out_g5
2018-09-28 10:35:18,063:INFO:root:Set fixed seed for random position selection = 238329
2018-09-28 10:35:18,074:INFO:root:Running Desman with 2 samples and 2822 variant positions finding 5 genomes.
2018-09-28 10:35:18,074:INFO:root:Set eta error transition matrix from = <_io.TextIOWrapper name='Bin_1-SNVs_outr_df.csv' mode='r' encoding='UTF-8'>
2018-09-28 10:35:18,078:INFO:root:Set second adjustable random seed = 23724839
2018-09-28 10:35:18,086:INFO:root:Perform NTF initialisation
2018-09-28 10:35:18,250:INFO:root:NTF Iter 0, div = 7033.098054
2018-09-28 10:35:23,079:INFO:root:NTF Iter 100, div = 27.149127
2018-09-28 10:35:28,126:INFO:root:NTF Iter 200, div = 26.413665
2018-09-28 10:35:33,046:INFO:root:NTF Iter 300, div = 25.664004
2018-09-28 10:35:37,922:INFO:root:NTF Iter 400, div = 24.269144
2018-09-28 10:35:42,835:INFO:root:NTF Iter 500, div = 22.555659
2018-09-28 10:35:47,722:INFO:root:NTF Iter 600, div = 20.267187
2018-09-28 10:35:52,605:INFO:root:NTF Iter 700, div = 16.463571
2018-09-28 10:35:57,485:INFO:root:NTF Iter 800, div = 12.501173
2018-09-28 10:36:02,373:INFO:root:NTF Iter 900, div = 7.588231
2018-09-28 10:36:07,234:INFO:root:NTF Iter 1000, div = 2.809085
2018-09-28 10:36:12,123:INFO:root:NTF Iter 1100, div = 0.659918
2018-09-28 10:36:17,012:INFO:root:NTF Iter 1200, div = 0.268430
2018-09-28 10:36:21,930:INFO:root:NTF Iter 1300, div = 0.147414
2018-09-28 10:36:26,919:INFO:root:NTF Iter 1400, div = 0.083881
2018-09-28 10:36:31,835:INFO:root:NTF Iter 1500, div = 0.054996
2018-09-28 10:36:36,732:INFO:root:NTF Iter 1600, div = 0.038581
2018-09-28 10:36:41,624:INFO:root:NTF Iter 1700, div = 0.028583
2018-09-28 10:36:46,453:INFO:root:NTF Iter 1800, div = 0.020659
2018-09-28 10:36:51,300:INFO:root:NTF Iter 1900, div = 0.015106
2018-09-28 10:36:56,135:INFO:root:NTF Iter 2000, div = 0.011695
2018-09-28 10:37:00,973:INFO:root:NTF Iter 2100, div = 0.009690
2018-09-28 10:37:05,833:INFO:root:NTF Iter 2200, div = 0.007904
2018-09-28 10:37:10,654:INFO:root:NTF Iter 2300, div = 0.006540
2018-09-28 10:37:15,171:INFO:root:Start Gibbs sampler burn-in phase

Any idea what can be going on? I removed the non-coding SNVs and run like this beforehand:

Variant_Filter.py Bin_1-SNVs_desman-nonc.csv -o Bin_1-SNVs_out -p
chrisquince commented 6 years ago

Hi Xabi, Can you upload your variants file Bin_1-SNVs_desman-nonc.csv and I will try running it? Thanks, Chris

xvazquezc commented 6 years ago

Here you have. Thank you. Bin_1-SNVs_desman-nonc.txt

xvazquezc commented 6 years ago

Hi @chrisquince I was wondering if you have had time to take a look. Thanks, Xabi

chrisquince commented 5 years ago

Dear Xabi,

Sorry for the delay responding. The problem is that you are using the wrong input file to DESMAN. Your command above should have been:

desman Bin_1-SNVs_outsel_var.csv -g 5 -i 50 -e Bin_1-SNVs_outtran_df.csv -o Bin_1-SNVs_out_g5

However, DESMAN will have problems resolving haplotypes from just two samples. I ran your data set. Running:

varFile='Bin_1-SNVs_outsel_var.csv'

eFile='Bin_1-SNVs_outtran_df.csv'

for g in 1 2 3 4
do echo $g for r in 0 1 2 3 4 do echo $r (desman $varFile -e $eFile -o Bin1-SNVs${g}_${r} -g $g -s $r)& done wait done

This gave two haplotypes:

python $DESMAN//scripts/resolvenhap.py Bin_1-SNVs

2,2,0,0.0430988660524,Bin_1-SNVs_2_0/Filtered_Tau_star.csv

,0,1 H2_HCO3,0.18987866167777373,0.8101213383222263 Isoprene,0.18646651165477213,0.8135334883452279

But with no difference between your samples. So I would be cautious interpreting these results.

Thanks, Chris

xvazquezc commented 5 years ago

Thanks Chris, that's more or less what I expected.

Even though I know the output might not be completely reliable, I'm still trying to continue so I can setup my Anvi'o to DESMAN workflow. Following the tutorials from the Penryn workshop, I'm trying to use the GetVariantsCore.py script but I'm a bit stuck as I don't use a completely standard input...

From the GetVariantsCore.py I get that

What I'm not sure is the structure the gene_file should have. I have looked into the code of the script and this is a sample of what I get:

,c_000000000039,2911,3135,11,1
,c_000000000039,3132,3263,12,1
,c_000000000041,2,547,13,0
,c_000000000041,684,950,14,0
,c_000000000041,1311,1451,15,0
,c_000000000041,1456,2379,16,0
,c_000000000041,2364,2996,17,0
,c_000000000041,3038,3904,18,0
,c_000000000041,4068,4274,19,0
,c_000000000042,2,223,20,0
,c_000000000042,272,541,21,0

With the columns in this order:

cog,contig,start,end,gene_id,strand

with strand 0 = forward and 1 = reverse (not sure about that).

As my input is not based on COGs I cannot provide one so I had to add the empty column (otherwise it complains). I tried by including all genes or only the genes included in the coregenes.txt in the gene_file with the same output error:

Open contigs.fasta reading fasta
Open ISODEG_MAG_00002-SNVs_2_0/Filtered_Tau_star.csv reading tau
Traceback (most recent call last):
  File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1506, in _has_valid_type
    error()
  File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1501, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1030] is not in the [index]'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xabi/anaconda2/envs/desman/bin/GetVariantsCore.py", line 97, in main
    gene_variants = tau_df.loc[gene]
  File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1626, in _getitem_axis
    self._has_valid_type(key, axis)
  File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1514, in _has_valid_type
    error()
  File "/home/xabi/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1501, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1030] is not in the [index]'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xabi/anaconda2/envs/desman/bin/GetVariantsCore.py", line 153, in <module>
    main(sys.argv[1:])
  File "/home/xabi/anaconda2/envs/desman/bin/GetVariantsCore.py", line 122, in main
    for g in range(G):
TypeError: 'float' object cannot be interpreted as an integer

It says something about 1030 which is one of the genes, but it is present in all the input files (except the contigs.fasta).

Do you see anything unusual in what I'm doing that needs to be fixed?

Again, thanks a lot. Xabi

osvatic commented 5 years ago

Has there been any continuation of this workflow? I would love to use anvi'o SNVs and an anvi'o bin as my starting data.

chrisquince commented 5 years ago

Hi Xabi,

I am doing that myself at the moment. When I am finished I could commit a workflow for that case? It won't be NextFlow though possibly just a bash script or ideally a snakemake. Have you exported the Anvio SNV profile correctly?

Best, Chris

marcomeola commented 3 years ago

Has anyone tried to import Desman results into Anvio for further analyses and visualization purposes instead? (don't know if it's worth to open a new thread for that)