bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 353 forks source link

New VCF export of cnvkit & VCF-only intregration of SV caller results #953

Closed schelhorn closed 7 years ago

schelhorn commented 9 years ago

This is a feature request. It appears as if cnvkit now supports VCF export (one-sample-VCFs only, but that's fine). Since delly, lumpy, wham, and manta already support VCF export (not sure about cn.mops, but it still is maintained so that could be added), it may be time to start thinking about a full VCF-based workflow for these callers similar to what bcbio offers for micro variants based on GEMINI (which should support SV VCFs, AFAIK).

While the existing BED-based ensemble SV calling implemention already generates an integrated and filtered result of all callers that includes most of the relevant output, it currently does not offer ploidy/genotype information for insertions and deletions. Since this information seems relevant for interpreting LOF of clinical phenotypes, I am hoping that dumping everything into GEMINI (either prior to or after ensemble-filtering) would allow for a more complete and standard-compliant view on SV results.

chapmanb commented 9 years ago

Sven-Eric; Thanks for this. We'll definitely incorporate the new CNVkit VCF output to have a consistent set of outputs, then work on providing a single VCF that includes structural variants/CNVs. We can use this issue to track the work.

Regarding SV ensemble calling, we're going to explore moving away from our current implementation towards MetaSV, which integrates callers and provides re-assembly around potential SVs:

https://github.com/bioinform/metasv

It tested well in the Genome in a Bottle benchmarks and is a much smarter approach than bcbio's current simple approach. The main work will be around adding support for the callers we use and validating.

Regarding genotyping, this would be nice to have as well. While some callers report them, we're missing a validation set to evaluate this against. Practically we need to add in genotyping to lumpy (https://github.com/arq5x/lumpy-sv#post-processing) and then evaluate the output from MetaSV.

Thanks again for the ideas and discussion.

schelhorn commented 9 years ago

Wow, great! Having a single VCF would already be a tremendous help. Does it make sense to load it into GEMINI at all or is that less appropriate since the population normals and impact annotation databases cannot deal with SVs anyway?

Concerning metasv: that's definitely the way to go. Having micro-assembly around ensemble SVs is really fantastic and the code base and maintenance of metasv seems to be all good. The assembler dependency spades already is in homebrew-science, but AGE isn't and they are using a fork of it as of now. Anyway, I guess it will be some work to get it going, but it would upgrade bcbio a lot (again).

I'll even set up a bounty for it: I hereby promise delivery of one crate of beer (or any legal beverage of choice of about equal value) for the submitter of the bcbio commit that enables metasv in the variant2 workflow, including manta input and resulting in a single, filtered VCF file (or GEMINI db) in the upload dir.

schelhorn commented 9 years ago

Anyway, since I expect that the BED ensemble format will continue to be useful, I'd propose the following change to bcbio.structural.ensemble._vcf_to_bed() in order to also store genotype call and copy number in the BED file, as exported from VCF (in that manner, we store at least structure and copy/ploidy of the variation in BED, even we we disregard variation content):

def _vcf_to_bed(in_file, caller, out_file):
    if in_file and in_file.endswith((".vcf", "vcf.gz")):
        with utils.open_gzipsafe(in_file) as in_handle:
            with open(out_file, "w") as out_handle:
                for rec in vcf.Reader(in_handle, in_file):
                    if not rec.FILTER:
                        if (rec.samples[0].gt_type != 0 and
                              not (hasattr(rec.samples[0].data, "FT") and rec.samples[0].data.FT)):
                            is_hom_alt = rec.samples[0].gt_type == 2
                            is_het     = rec.samples[0].gt_type == 1
                            gt_call    = is_hom_alt and 'gt:hom' or (is_het and 'gt:het' or 'gt:NA')
                            gt_ploidy  = hasattr(rec.samples[0].data, 'CN') and ('cn:' + str(rec.samples[0].data.CN)) or 'cn:NA'
                            start      = rec.start - 1
                            end        = int(rec.INFO.get("END", rec.start))
                            if end - start < MAX_SVSIZE:
                                out_handle.write("\t".join([rec.CHROM, str(start), str(end),
                                                            "%s_%s_%s_%s" % (_get_svtype(rec), caller, gt_call, gt_ploidy)])
                                                 + "\n")

The BED-based cnvkit format could be adapted until VCF use for cnvkit is enabled (by cnvkit.py call Sample.cns -o Sample.call.cns; cnvkit.py export vcf Sample.call.cns -o Sample.vcf):

def _cnvkit_to_bed(in_file, caller, out_file):
    if not utils.file_uptodate(out_file, in_file):
        with file_transaction({}, out_file) as tx_out_file:
            with open(in_file) as in_handle:
                with open(tx_out_file, "w") as out_handle:
                    for line in in_handle:
                        chrom, start, end, sample, copyn = line.strip().split("\t")[:5]
                        out_handle.write("\t".join([chrom, start, end,
                                                    "CNV_%s_gt:NA_cn:%s" % (caller, copyn.replace(",", ";"))])
                                         + "\n")
    return out_file

Of course that would require some change of the fields looked at in the ensemble functions, since right now everything after the first _ is used to determine the identity of the variation caller.

schelhorn commented 9 years ago

I have done a quick check on the VCF genotype output capabilities of some of the SV callers supported by bcbio. It appears as if cnvkit is the only caller able to support the CN field for storing copy numbers. Although it also supports traditional genotype (zygosity) calls, cnvkit seems to presuppose that all genotype calls are heterozygous.

Regarding the breakpoint callers, delly and manta seem fine with regard to genotype calls (supporting both homozygous ALTs and heterozygous ALTs), while lumpy does not support genotype calls but requires an external tool to do so:

SVTyper can call genotypes on LUMPY output VCF files using a Bayesian maximum likelihood algorithm.

delly, manta and lumpy do not seem to support CN in their VCF output.

chapmanb commented 9 years ago

Sven-Eric; Thanks for all these thoughts, I wanted to wait until finalizing MetaSV validations before following up in order to come up with a plan moving forward. Here is a validation of MetaSV from the current development version of bcbio:

http://imgur.com/a/Gajsg

It's looking great, and I think is the right way forward since we can contribute to a community project instead of building custom extensions on bcbio's overlap based method.

So instead of extending our BED format, we'll plan to move to using VCF for all of the callers and use MetaSV as the option for producing an ensemble callset. This will let us avoid inventing our own syntax in the BED name format, although we'll need to deal with VCF inconsistencies.

After figuring out the VCF genotype issues you noted for CNVkit, we'll have support for VCF for all callers and can continue to push forward with this. Thanks again for all of the interest and work on this.

schelhorn commented 9 years ago

That seems to make a lot of sense and I especially appreciate the quick implementation, Brad. I'll be first to test anything you care to push to dev.

schelhorn commented 9 years ago

After figuring out the VCF genotype issues you noted for CNVkit, we'll have support for VCF for all callers and can continue to push forward with this.

@etal kindly adapted the cnvkit sources to deal with the genotype issue. But no pressure.

etal commented 9 years ago

Well, for diploid chromosomes. I'll handle haploid and polyploid chromosomes today.

MetaSV currently doesn't have a wrapper for CNVkit, though this should be easy to add, but also I don't see support in general for targeted captures. I'll ask the MetaSV folks if they want to go there.

CNVkit's performance in the WGS benchmarks looks ghastly. I could look into tuning it -- is there any code I'd need to run these benchmarks outside bcbio-nextgen? I'm also wondering how CNVnator and cn.MOPS perform on the same data. It might be best to just not run CNVkit on WGS data if the options already in MetaSV are much better.

chapmanb commented 9 years ago

Eric; Thanks so much for the VCF updates. I'll work on pulling these into bcbio and making the VCF output the default so we can have a consistent set of output formats.

I have a pull request in that adds CNVkit support to MetaSV as part of this:

https://github.com/bioinform/metasv/pull/55

I don't think we need to do anything special for targeted support, as it's combining existing calls and then re-checking ends. So if we do the calling on targeted regions outside of it and feed those in we should be fine.

Regarding validation, I'm not sure how we expect CNV callers to look on this type of validation data. The ensemble MetaSV calls include CNVkit as input so my main goal so far is to have that be producing sensitive and precise output. I haven't investigate cn.mops on this as we've been more focused on CNVkit integration. If it helps I'm happy to dig into the inconsistent calls and see if I can turn up any patterns where we can filter/process better.

lpantano commented 7 years ago

Hi

I am closing this because it seems an old issue and it seems this is already part of the pipeline.

Come back if you find other issues or want to continue with this one.

cheers