Open nick-youngblut opened 3 years ago
@nick-youngblut
Hi, Nick,
When comparing different sample files, each jgi file will have a different totalAvgDepth
, but when using paste to merge together, only the data of the first sample is taken. Does this mean that the totalAvgDepth
column is not effect?
When I run:
rm -rf results/06.multisplit_binning/bins/all.megahit.combined.out/vamb
mkdir -p results/06.multisplit_binning/bins/all.megahit.combined.out/
vamb \
--outdir results/06.multisplit_binning/bins/all.megahit.combined.out/vamb \
--fasta results/04.assembly/scaftigs/all.megahit.combined.out/all.megahit.combined.scaftigs.fa.gz \
--jgi results/06.multisplit_binning/coverage/all.megahit.combined.out/jgi.abundance.matrix.tsv \
-o C \
-m 2000 \
--minfasta 500000 \
2> results/06.multisplit_binning/logs/binning/all.megahit.vamb.binning.log
I always meet this error:
Traceback (most recent call last):
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv/bin/vamb", line 11, in <module>
sys.exit(main())
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/vamb/__main__.py", line 528, in m
logfile=logfile)
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/vamb/__main__.py", line 247, in r
len(tnfs), minalignscore, minid, subprocesses, logfile)
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/vamb/__main__.py", line 121, in c
raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs
Info:
vamb 3.0.2
I wrote this script to convert the raw output from jgi_summarize_bam_contig_depths
to a format that seems to work for vamb. I provide all bam files to jgi_summarize_bam_contig_depths
at the same time, so the totalAvgDepth
is calculated for all samples. For instance:
jgi_summarize_bam_contig_depths --outputDepth jgi_output.txt sample1.sort.bam sample2.sort.bam sample3.sort.bam
My script for formatting:
#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import re
import argparse
import logging
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.DEBUG)
class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter,
argparse.RawDescriptionHelpFormatter):
pass
desc = 'Converting jgi_summarize_bam_contig_depths output to format used by VAMB'
epi = """DESCRIPTION:
Output format: contigName<tab>contigLen<tab>totalAvgDepth<tab>SAMPLE1.sort.bam<tab>Sample2.sort.bam<tab>...
Output written to STDOUT
"""
parser = argparse.ArgumentParser(description=desc, epilog=epi,
formatter_class=CustomFormatter)
argparse.ArgumentDefaultsHelpFormatter
parser.add_argument('jgi_file', metavar='jgi_file', type=str,
help='jgi_summarize_bam_contig_depths output table')
parser.add_argument('--version', action='version', version='0.0.1')
def main(args):
# parsing input
header = {}
col2keep = ['contigName', 'contigLen', 'totalAvgDepth']
with open(args.jgi_file) as inF:
for i,line in enumerate(inF):
line = line.rstrip().split('\t')
if i == 0:
header = {x:ii for ii,x in enumerate(line)}
col2keep += [x for x in line if x.endswith('.bam')]
print('\t'.join(col2keep))
continue
elif line[0] == '':
continue
# contig ID
contig = line[header['contigName']]
# collect per-sample info
out = []
for col in col2keep:
out.append(line[header[col]])
print('\t'.join(out))
if __name__ == '__main__':
args = parser.parse_args()
main(args)
It seems to work with vamb, but maybe I'm missing some details in how the jgi-table should be formatted.
@nick-youngblut Hi, Nick, Good work !
@nick-youngblut To be shure you are using normal (position) sorted bam not readname-sorted as described for direct input into vamb?
Pinging @simonrasmu , he surely knows
@nick-youngblut To be shure you are using normal (position) sorted bam not readname-sorted as described for direct input into vamb?
I'm using the same workflow as shown in https://github.com/RasmussenLab/vamb/blob/master/workflow/vamb.snake.conda.py, but I converted the rules cut_column1to3
, cut_column4to5
, and paste_abundances
into that python script shown above. I also provide bam files for all metagenome samples to jgi_summarize_bam_contig_depths
.
Hi @nick-youngblut and @alienzj
Thanks for trying out the workflow! Vamb uses the standard jgi depth format as input, however it only uses the individual sample depth columns and as you identified above the "totalAvgDepth" column will be wrong when running it use the snakemake. The reason for implementing it like this was to make the workflow as parallel as possible. Therefore it runs jgi-depth on each file in parallel and then combines it using the cut_column1to3
, cut_column4to5
and paste_abundances
. This is in particular important if running on very large datasets with many samples where jgi-depth on the entire contig and sample dataset can be very slow, but runs fine individually.
Hope it helps!
With regard to using read sorted or position sorted bams as input to Vamb:
Trying to figure that out as well. Looks like the loading is here: https://github.com/RasmussenLab/vamb/blob/734b741b85296377937de54166b7db274bc7ba9c/vamb/vambtools.py#L583
def _load_jgi(filehandle, minlength, refhash):
"This function can be merged with load_jgi below in the next breaking release (post 3.0)"
header = next(filehandle)
fields = header.strip().split('\t')
if not fields[:3] == ["contigName", "contigLen", "totalAvgDepth"]:
raise ValueError('Input file format error: First columns should be "contigName,"'
'"contigLen" and "totalAvgDepth"')
columns = tuple([i for i in range(3, len(fields)) if not fields[i].endswith("-var")])
array = PushArray(_np.float32)
identifiers = list()
for row in filehandle:
fields = row.split('\t')
# We use float because very large numbers will be printed in scientific notation
if float(fields[1]) < minlength:
continue
for col in columns:
array.append(float(fields[col]))
identifiers.append(fields[0])
if refhash is not None:
hash = _hash_refnames(identifiers)
if hash != refhash:
errormsg = ('JGI file has reference hash {}, expected {}. '
'Verify that all BAM headers and FASTA headers are '
'identical and in the same order.')
raise ValueError(errormsg.format(hash.hex(), refhash.hex()))
result = array.take()
result.shape = (len(result) // len(columns), len(columns))
return validate_input_array(result)
def load_jgi(filehandle):
"""Load depths from the --outputDepth of jgi_summarize_bam_contig_depths.
See https://bitbucket.org/berkeleylab/metabat for more info on that program.
Usage:
with open('/path/to/jgi_depths.tsv') as file:
depths = load_jgi(file)
Input:
File handle of open output depth file
Output:
N_contigs x N_samples Numpy matrix of dtype float32
"""
return _load_jgi(filehandle, 0, None)
Does it matter what the order of samples is in the JGI coverage file?
I've concatenated a bunch of JGI coverage files together. Is it possible the order is off?
The order of the JGI coverage "should not" matter. It reads the data and does the analysis without using information on each sample. Whether one will actually get the exact same results if one shuffles the order of the samples I am actually not sure.
While the workflow is very helpful, it doesn't actually state the required format for
--jgi
. It appears that a custom format is required, given that--noIntraDepthVariance
is run forjgi_summarize_bam_contig_depths
, and then your process the output with 3 rules:cut_column1to3
,cut_column4to5
, andpaste_abundances
.It would be helpful to have an explicit description of what the required input is for
--jgi
, especially if the rawjgi_summarize_bam_contig_depths
output table includes coverage info for multiple metagenome samples.