Closed fgvieira closed 1 month ago
Dear @fgvieira,
Thank you for the feature suggestion! We keep track of the users suggestions and the frequently requested features are often implemented. Bear in mind that it might take some time to implement it, mainly because of other features already planned and our release cycle (~3 months).
Best regards, Laurent
I've been looking into this, and it seems it should be possible to make using Ensemble VEP perl modules. However, I've been having some issues getting it to properly read the VCF. Here is what I have so far:
use Bio::EnsEMBL::VEP::Config;
use Bio::EnsEMBL::VEP::Parser::VCF;
use Bio::EnsEMBL::VEP::InputBuffer;
use Bio::EnsEMBL::VEP::OutputFactory::JSON;
my $cfg = Bio::EnsEMBL::VEP::Config->new();
my $vcf_file = 'test.vcf';
my $p = Bio::EnsEMBL::VEP::Parser::VCF->new({config => $cfg, file => $vcf_file});
my $ib = Bio::EnsEMBL::VEP::InputBuffer->new({config => $cfg, parser => $p});
my $of = Bio::EnsEMBL::VEP::OutputFactory::Tab->new({config => $cfg,});
# Print headers
print "$_\n" for @{$of->headers};
# Print output
print "$_\n" for @{$of->get_all_lines_by_InputBuffer($ib)};
Am I missing something?
Is the $cfg
variable needed? If so, what should it contain?
thanks,
Have you tried vcf-to-tab which is part of VCFtools?
Neither your code nor vcf-to-tab will convert VCF-encoded VEP output to the default tab-delimited VEP output, which is what I assume @fgvieira requires.
Your best bet for starting some code would be to look at filter_vep
, which contains code for parsing the VEP results encoded in VCF.
I had a quick hack at filter_vep
recently attempting a similar thing (trying to convert to JSON), here it is in a Gist: https://gist.github.com/willmclaren/95ffc7a6479a9dc9eb937fc80754b6cf . At the moment all it does is parse the VCF line into a nested hash structure, but you should be able to work from the Data::Dumper
lines to get to what you want.
@willmclaren thanks for the script!
I have been tweaking it a bit and managed to get it to output a TSV file.
As for the JSON, I would like it to be as compatible as possible with the original VEP JSON format (here), but the labels used on the VEP JSON are different (e.g. Allele
in the VCF, turns into variant_allele
in the JSON). Is this "translation" available anywhere?
Also, there seems to be more info on the JSON file (e.g. on the VCF there is a single SIFT field, where on the JSON there is sift_prediction and sift_score). Where can I get all the extra information (I am already running VEP with the --everything
flag)?
thanks,
The keys are renamed according to this hash; see the rest of that module for a clue as to how the standard output data structure is re-structured into JSON. Apart from the labels there are a few other differences between the JSON and other formats.
The standard SIFT output is just a combination of the score and prediction written as prediction(score)
(assuming you're using --sift b
, see the flag docs for more info).
I'm currently working towards getting VEP to output both VCF and JSON in the same run rather than post-converting one to the other (it's easier to avoid data loss or horrible refactoring this way); this will likely take the form of a plugin to generate the JSON alongside the main VCF output. I can post the results here if they're worth doing so :-)
Ok, I'll have a look. thanks!
Right now I am running VEP (output in VCF), filtering the output, and only then do I need to convert to JSON (and TSV). So, unless you can also apply a filter to the JSON file, I think it would actually be better to have another tool (or filter_vep script) to convert the output. But please post the results here! :)
You might be interested in the vep-annotation-reporter that is part of VAtools (vatools.org). Given a VEP-annotated VCF and a list of VEP fields it will convert the VCF entries to TSV and append the selected VEP fields as individual columns.
Wrote a parser because I couldn't find a way to get tabular VEP data that preserves the original VCF positional, INFO, and FORMAT data. It adjusts on the fly to handle different fields and file headers. The only package it uses is pandas. I also provided the code to explode the nested VCF columns.
For context, I'm using the VEP Docker image with these options (offline, vcf, latest gnomad).
vep -i in.vcf -o out.vcf --cache --offline --no_stats --quiet --vcf --symbol --pick --af_gnomadg --polyphen b --sift b
There is a line in the vcf header that will look something like this depending on the fields/ order you specify:
INFO=
Let's parse whatever VEP fields we have out of VCF INFO and turn them into columns
import pandas as pd
# Load our vep-annotated vcf file
case_id = 'YOUR_CASE_ID'
file_path = f"{case_id}.anno.vcf"
with open(file_path) as f:
lines = f.readlines()
# Extract VEP the schema above from file header
prefix_schema = "Ensembl VEP. Format: "
for line in lines:
if (prefix_schema in line):
# We want everything after format and before closing syntax – broken into list
vep_cols = line.split(prefix_schema)[1].split('">')[0].split('|')
vep_cols = [f"VEP_{c}" for c in vep_cols]
break
# Where file header ends and real data begins
prefix_cols = "#CHROM"
for i, line in enumerate(lines):
if (line.startswith(prefix_cols)):
# Pandas wants line number - 1 so don't worry about zero-based index
header_lineNum = i
break
# Read vcf into a dataframe
df_vcf = pd.read_csv(file_path, sep='\t', header=header_lineNum)
df_vcf.rename(columns={'#CHROM':'CHROM'}, inplace=True)
# Everything before `;CSQ=` becomes INFO col / after becomes VEP col
df_vcf[['INFO', 'VEP']] = df_vcf['INFO'].str.split(';CSQ=', 1, expand=True)
# Explode the new VEP col
df_vcf[vep_cols] = df_vcf['VEP'].str.split('|',expand=True)
df_vcf = df_vcf.drop(columns=['VEP'])
All of my variants were using the same format fields.
# Check that format is the same for all variants
if (len(set(df_vcf['FORMAT'])) == 1):
# Break the format keys into a list
format_cols = df_vcf['FORMAT'][0].split(':')
format_cols = [f"FORMAT_{c}" for c in format_cols]
# Explode the formatted sample-specific data
df_vcf[format_cols] = df_vcf[case_id].str.split(':',expand=True)
df_vcf = df_vcf.drop(columns=[case_id, 'FORMAT'])
else:
print("Warning – FORMAT not be exploded as format not the same for all variants")
Remember, at this point, the VEP fields are no longer in the VEP INFO column.
This handles sparsity. Some of my variants weren't using the same INFO fields
# Sparsity means parsing rows individually
info_entries = []
for row in df_vcf['INFO']:
# Break the info string into key-value pairs
row_pairs = dict(item.split("=") for item in row.split(";"))
info_entries.append(row_pairs)
# Easy to make df from dict, but doing so makes separate df
df_info = pd.DataFrame.from_dict(info_entries)
# Explode the info column
info_cols = df_info.columns.tolist()
info_colsRename = [f"INFO_{c}" for c in info_cols]
renames = dict(zip(info_cols, info_colsRename))
df_info = df_info.rename(columns=renames)
# Merge back with the main df
df_vcf = pd.concat([df_vcf, df_info], axis=1)
df_vcf = df_vcf.drop(columns=['INFO'])
del df_info
df_vcf.to_csv(f"{case_id}.tab.anno.vcf", sep='\t', index=False)
Break free from the shackles of overly-specific command line tools for working with nested data
Hi @Hoeze, Hope you are well? Following discussions with the team, we have decided this is beyond the scope of Ensembl Variant Effect Predictor as a tool. Please let us know if you have any other issues. Thank you. Ola.
It would be nice if there was a way to convert VCFs annotated by VEP to tab format. It could be just adding the option
-tab
tofilter_vep
(and the user could just run it without filters if just wanted to convert).I know VEP can output in TAB (
--tab
), but sometimes you just don't want to keep annotating files just to convert the format.