RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
249 stars 45 forks source link

format of jgi input #48

Open nick-youngblut opened 3 years ago

nick-youngblut commented 3 years ago

While the workflow is very helpful, it doesn't actually state the required format for --jgi. It appears that a custom format is required, given that --noIntraDepthVariance is run for jgi_summarize_bam_contig_depths, and then your process the output with 3 rules: cut_column1to3, cut_column4to5, and paste_abundances.

It would be helpful to have an explicit description of what the required input is for --jgi, especially if the raw jgi_summarize_bam_contig_depths output table includes coverage info for multiple metagenome samples.

alienzj commented 3 years ago

@nick-youngblut Hi, Nick, When comparing different sample files, each jgi file will have a different totalAvgDepth, but when using paste to merge together, only the data of the first sample is taken. Does this mean that the totalAvgDepth column is not effect?

alienzj commented 3 years ago

When I run:

rm -rf results/06.multisplit_binning/bins/all.megahit.combined.out/vamb
mkdir -p results/06.multisplit_binning/bins/all.megahit.combined.out/

vamb \
--outdir results/06.multisplit_binning/bins/all.megahit.combined.out/vamb \
--fasta results/04.assembly/scaftigs/all.megahit.combined.out/all.megahit.combined.scaftigs.fa.gz \
--jgi results/06.multisplit_binning/coverage/all.megahit.combined.out/jgi.abundance.matrix.tsv \
-o C \
-m 2000 \
--minfasta 500000 \
2> results/06.multisplit_binning/logs/binning/all.megahit.vamb.binning.log

I always meet this error:

Traceback (most recent call last):
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv/bin/vamb", line 11, in <module>
sys.exit(main())
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/vamb/__main__.py", line 528, in m
logfile=logfile)
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/vamb/__main__.py", line 247, in r
len(tnfs), minalignscore, minid, subprocesses, logfile)
File "/ldfssz1/ST_META/share/User/zhujie/.conda/envs/bioenv3.7/lib/python3.7/site-packages/vamb/__main__.py", line 121, in c
raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Info:

vamb 3.0.2
nick-youngblut commented 3 years ago

I wrote this script to convert the raw output from jgi_summarize_bam_contig_depths to a format that seems to work for vamb. I provide all bam files to jgi_summarize_bam_contig_depths at the same time, so the totalAvgDepth is calculated for all samples. For instance:

jgi_summarize_bam_contig_depths --outputDepth jgi_output.txt sample1.sort.bam sample2.sort.bam sample3.sort.bam

My script for formatting:

#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import re
import argparse
import logging

logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.DEBUG)
class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter,
                      argparse.RawDescriptionHelpFormatter):
    pass

desc = 'Converting jgi_summarize_bam_contig_depths output to format used by VAMB'
epi = """DESCRIPTION:
Output format: contigName<tab>contigLen<tab>totalAvgDepth<tab>SAMPLE1.sort.bam<tab>Sample2.sort.bam<tab>...
Output written to STDOUT
"""
parser = argparse.ArgumentParser(description=desc, epilog=epi,
                                 formatter_class=CustomFormatter)
argparse.ArgumentDefaultsHelpFormatter
parser.add_argument('jgi_file', metavar='jgi_file', type=str,
                    help='jgi_summarize_bam_contig_depths output table')
parser.add_argument('--version', action='version', version='0.0.1')

def main(args):
    # parsing input
    header = {}
    col2keep = ['contigName', 'contigLen', 'totalAvgDepth']
    with open(args.jgi_file) as inF:
        for i,line in enumerate(inF):
            line = line.rstrip().split('\t')
            if i == 0:
                header = {x:ii for ii,x in enumerate(line)}
                col2keep += [x for x in line if x.endswith('.bam')]
                print('\t'.join(col2keep))
                continue
            elif line[0] == '':
                continue
            # contig ID
            contig = line[header['contigName']]
            # collect per-sample info
            out = []
            for col in col2keep:
                out.append(line[header[col]])
            print('\t'.join(out))

if __name__ == '__main__':
    args = parser.parse_args()
    main(args)

It seems to work with vamb, but maybe I'm missing some details in how the jgi-table should be formatted.

alienzj commented 3 years ago

@nick-youngblut Hi, Nick, Good work !

SilasK commented 3 years ago

@nick-youngblut To be shure you are using normal (position) sorted bam not readname-sorted as described for direct input into vamb?

jakobnissen commented 3 years ago

Pinging @simonrasmu , he surely knows

nick-youngblut commented 3 years ago

@nick-youngblut To be shure you are using normal (position) sorted bam not readname-sorted as described for direct input into vamb?

I'm using the same workflow as shown in https://github.com/RasmussenLab/vamb/blob/master/workflow/vamb.snake.conda.py, but I converted the rules cut_column1to3, cut_column4to5, and paste_abundances into that python script shown above. I also provide bam files for all metagenome samples to jgi_summarize_bam_contig_depths.

simonrasmu commented 3 years ago

Hi @nick-youngblut and @alienzj

Thanks for trying out the workflow! Vamb uses the standard jgi depth format as input, however it only uses the individual sample depth columns and as you identified above the "totalAvgDepth" column will be wrong when running it use the snakemake. The reason for implementing it like this was to make the workflow as parallel as possible. Therefore it runs jgi-depth on each file in parallel and then combines it using the cut_column1to3, cut_column4to5and paste_abundances. This is in particular important if running on very large datasets with many samples where jgi-depth on the entire contig and sample dataset can be very slow, but runs fine individually.

Hope it helps!

simonrasmu commented 3 years ago

With regard to using read sorted or position sorted bams as input to Vamb:

jolespin commented 3 years ago

Trying to figure that out as well. Looks like the loading is here: https://github.com/RasmussenLab/vamb/blob/734b741b85296377937de54166b7db274bc7ba9c/vamb/vambtools.py#L583

def _load_jgi(filehandle, minlength, refhash):
    "This function can be merged with load_jgi below in the next breaking release (post 3.0)"
    header = next(filehandle)
    fields = header.strip().split('\t')
    if not fields[:3] == ["contigName", "contigLen", "totalAvgDepth"]:
        raise ValueError('Input file format error: First columns should be "contigName,"'
        '"contigLen" and "totalAvgDepth"')

    columns = tuple([i for i in range(3, len(fields)) if not fields[i].endswith("-var")])
    array = PushArray(_np.float32)
    identifiers = list()

    for row in filehandle:
        fields = row.split('\t')
        # We use float because very large numbers will be printed in scientific notation
        if float(fields[1]) < minlength:
            continue

        for col in columns:
            array.append(float(fields[col]))

        identifiers.append(fields[0])

    if refhash is not None:
        hash = _hash_refnames(identifiers)
        if hash != refhash:
            errormsg = ('JGI file has reference hash {}, expected {}. '
                        'Verify that all BAM headers and FASTA headers are '
                        'identical and in the same order.')
            raise ValueError(errormsg.format(hash.hex(), refhash.hex()))

    result = array.take()
    result.shape = (len(result) // len(columns), len(columns))
    return validate_input_array(result)

def load_jgi(filehandle):
    """Load depths from the --outputDepth of jgi_summarize_bam_contig_depths.
    See https://bitbucket.org/berkeleylab/metabat for more info on that program.

    Usage:
        with open('/path/to/jgi_depths.tsv') as file:
            depths = load_jgi(file)
    Input:
        File handle of open output depth file
    Output:
        N_contigs x N_samples Numpy matrix of dtype float32
    """
    return _load_jgi(filehandle, 0, None)
jolespin commented 3 years ago

Does it matter what the order of samples is in the JGI coverage file?

I've concatenated a bunch of JGI coverage files together. Is it possible the order is off?

simonrasmu commented 3 years ago

The order of the JGI coverage "should not" matter. It reads the data and does the analysis without using information on each sample. Whether one will actually get the exact same results if one shuffles the order of the samples I am actually not sure.