Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 151 forks source link

Feature request: convert VCF to TAB #346

Closed fgvieira closed 1 week ago

fgvieira commented 5 years ago

It would be nice if there was a way to convert VCFs annotated by VEP to tab format. It could be just adding the option -tab to filter_vep (and the user could just run it without filters if just wanted to convert).

I know VEP can output in TAB (--tab), but sometimes you just don't want to keep annotating files just to convert the format.

ens-lgil commented 5 years ago

Dear @fgvieira,

Thank you for the feature suggestion! We keep track of the users suggestions and the frequently requested features are often implemented. Bear in mind that it might take some time to implement it, mainly because of other features already planned and our release cycle (~3 months).

Best regards, Laurent

fgvieira commented 5 years ago

I've been looking into this, and it seems it should be possible to make using Ensemble VEP perl modules. However, I've been having some issues getting it to properly read the VCF. Here is what I have so far:

use Bio::EnsEMBL::VEP::Config;
use Bio::EnsEMBL::VEP::Parser::VCF;
use Bio::EnsEMBL::VEP::InputBuffer;
use Bio::EnsEMBL::VEP::OutputFactory::JSON;

my $cfg = Bio::EnsEMBL::VEP::Config->new();

my $vcf_file = 'test.vcf';
my $p = Bio::EnsEMBL::VEP::Parser::VCF->new({config => $cfg, file => $vcf_file});
my $ib = Bio::EnsEMBL::VEP::InputBuffer->new({config => $cfg, parser => $p});

my $of = Bio::EnsEMBL::VEP::OutputFactory::Tab->new({config => $cfg,});
# Print headers                                                                                                                                                                                                
print "$_\n" for @{$of->headers};
# Print output                                                                                                                                                                                                     
print "$_\n" for @{$of->get_all_lines_by_InputBuffer($ib)};

Am I missing something? Is the $cfg variable needed? If so, what should it contain?

thanks,

at7 commented 5 years ago

Have you tried vcf-to-tab which is part of VCFtools?

willmclaren commented 5 years ago

Neither your code nor vcf-to-tab will convert VCF-encoded VEP output to the default tab-delimited VEP output, which is what I assume @fgvieira requires.

Your best bet for starting some code would be to look at filter_vep, which contains code for parsing the VEP results encoded in VCF.

I had a quick hack at filter_vep recently attempting a similar thing (trying to convert to JSON), here it is in a Gist: https://gist.github.com/willmclaren/95ffc7a6479a9dc9eb937fc80754b6cf . At the moment all it does is parse the VCF line into a nested hash structure, but you should be able to work from the Data::Dumper lines to get to what you want.

fgvieira commented 5 years ago

@willmclaren thanks for the script!

I have been tweaking it a bit and managed to get it to output a TSV file.

As for the JSON, I would like it to be as compatible as possible with the original VEP JSON format (here), but the labels used on the VEP JSON are different (e.g. Allele in the VCF, turns into variant_allele in the JSON). Is this "translation" available anywhere?

Also, there seems to be more info on the JSON file (e.g. on the VCF there is a single SIFT field, where on the JSON there is sift_prediction and sift_score). Where can I get all the extra information (I am already running VEP with the --everything flag)?

thanks,

willmclaren commented 5 years ago

The keys are renamed according to this hash; see the rest of that module for a clue as to how the standard output data structure is re-structured into JSON. Apart from the labels there are a few other differences between the JSON and other formats.

The standard SIFT output is just a combination of the score and prediction written as prediction(score) (assuming you're using --sift b, see the flag docs for more info).

I'm currently working towards getting VEP to output both VCF and JSON in the same run rather than post-converting one to the other (it's easier to avoid data loss or horrible refactoring this way); this will likely take the form of a plugin to generate the JSON alongside the main VCF output. I can post the results here if they're worth doing so :-)

fgvieira commented 5 years ago

Ok, I'll have a look. thanks!

Right now I am running VEP (output in VCF), filtering the output, and only then do I need to convert to JSON (and TSV). So, unless you can also apply a filter to the JSON file, I think it would actually be better to have another tool (or filter_vep script) to convert the output. But please post the results here! :)

thedam commented 5 years ago

try vcfR https://knausb.github.io/vcfR_documentation/

susannasiebert commented 5 years ago

You might be interested in the vep-annotation-reporter that is part of VAtools (vatools.org). Given a VEP-annotated VCF and a list of VEP fields it will convert the VCF entries to TSV and append the selected VEP fields as individual columns.

aiqc commented 1 year ago

Dynamic Tabular Parser

Wrote a parser because I couldn't find a way to get tabular VEP data that preserves the original VCF positional, INFO, and FORMAT data. It adjusts on the fly to handle different fields and file headers. The only package it uses is pandas. I also provided the code to explode the nested VCF columns.



For context, I'm using the VEP Docker image with these options (offline, vcf, latest gnomad).

vep -i in.vcf -o out.vcf --cache --offline --no_stats --quiet --vcf --symbol --pick --af_gnomadg --polyphen b --sift b



Tabular VEP Fields

There is a line in the vcf header that will look something like this depending on the fields/ order you specify:

INFO=


Let's parse whatever VEP fields we have out of VCF INFO and turn them into columns

import pandas as pd

# Load our vep-annotated vcf file
case_id = 'YOUR_CASE_ID'
file_path = f"{case_id}.anno.vcf"
with open(file_path) as f:
    lines = f.readlines()

# Extract VEP the schema above from file header
prefix_schema = "Ensembl VEP. Format: "
for line in lines:
    if (prefix_schema in line):              
        # We want everything after format and before closing syntax – broken into list
        vep_cols = line.split(prefix_schema)[1].split('">')[0].split('|')
        vep_cols = [f"VEP_{c}" for c in vep_cols]
        break

# Where file header ends and real data begins
prefix_cols = "#CHROM"
for i, line in enumerate(lines):
    if (line.startswith(prefix_cols)): 
        # Pandas wants line number - 1 so don't worry about zero-based index
        header_lineNum = i
        break

# Read vcf into a dataframe
df_vcf = pd.read_csv(file_path, sep='\t', header=header_lineNum)
df_vcf.rename(columns={'#CHROM':'CHROM'}, inplace=True)

# Everything before `;CSQ=` becomes INFO col / after becomes VEP col
df_vcf[['INFO', 'VEP']] = df_vcf['INFO'].str.split(';CSQ=', 1, expand=True)

# Explode the new VEP col
df_vcf[vep_cols] = df_vcf['VEP'].str.split('|',expand=True)
df_vcf = df_vcf.drop(columns=['VEP'])



[Optional] Tabular VCF FORMAT Fields

All of my variants were using the same format fields.

# Check that format is the same for all variants
if (len(set(df_vcf['FORMAT'])) == 1):

    # Break the format keys into a list
    format_cols = df_vcf['FORMAT'][0].split(':')
    format_cols = [f"FORMAT_{c}" for c in format_cols]

    # Explode the formatted sample-specific data 
    df_vcf[format_cols] = df_vcf[case_id].str.split(':',expand=True)
    df_vcf = df_vcf.drop(columns=[case_id, 'FORMAT'])
else:
    print("Warning – FORMAT not be exploded as format not the same for all variants")



[Optional] Tabular VCF INFO Fields

Remember, at this point, the VEP fields are no longer in the VEP INFO column.

This handles sparsity. Some of my variants weren't using the same INFO fields

# Sparsity means parsing rows individually
info_entries = []
for row in df_vcf['INFO']: 
    # Break the info string into key-value pairs
    row_pairs = dict(item.split("=") for item in row.split(";"))
    info_entries.append(row_pairs)

# Easy to make df from dict, but doing so makes separate df
df_info = pd.DataFrame.from_dict(info_entries)

# Explode the info column
info_cols = df_info.columns.tolist()
info_colsRename = [f"INFO_{c}" for c in info_cols]
renames = dict(zip(info_cols, info_colsRename))
df_info = df_info.rename(columns=renames)

# Merge back with the main df
df_vcf = pd.concat([df_vcf, df_info], axis=1)
df_vcf = df_vcf.drop(columns=['INFO'])
del df_info



Save It

df_vcf.to_csv(f"{case_id}.tab.anno.vcf", sep='\t', index=False)



Explore Columnar Data with Ease

image

image

Break free from the shackles of overly-specific command line tools for working with nested data

olaaustine commented 1 week ago

Hi @Hoeze, Hope you are well? Following discussions with the team, we have decided this is beyond the scope of Ensembl Variant Effect Predictor as a tool. Please let us know if you have any other issues. Thank you. Ola.