The VEP package requires:
The remaining dependencies can be installed using the included INSTALL.pl script. Basic instructions:
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl
The installer may also be used to check for updates to this and co-dependent packages, simply re-run INSTALL.pl.
See documentation for full installation instructions.
The following modules are optional but most users will benefit from installing them. We recommend using cpanminus to install.
--database
or --cache
without --offline
)A docker image for VEP is available from DockerHub.
See documentation for the Docker installation instructions.
./vep -i input.vcf -o out.txt -offline
See documentation for full command line instructions.
Please report any bugs or issues by contacting Ensembl or creating a GitHub issue
haplo
is a local tool implementation of the same functionality that powers the Ensembl transcript haplotypes view. It takes phased genotypes from a VCF and constructs a pair of haplotype sequences for each overlapped transcript; these sequences are also translated into predicted protein haplotype sequences. Each variant haplotype sequence is aligned and compared to the reference, and an HGVS-like name is constructed representing its differences to the reference.
This approach offers an advantage over VEP's analysis, which treats each input variant independently. By considering the combined change contributed by all the variant alleles across a transcript, the compound effects the variants may have are correctly accounted for.
haplo
shares much of the same command line functionality with vep
, and can use VEP caches, Ensembl databases, GFF and GTF files as sources of transcript data; all vep
command line flags relating to this functionality work the same with haplo
.
Input data must be a VCF containing phased genotype data for at least one individual and file must be sorted by chromosome and genomic position; no other formats are currently supported.
When using a VEP cache as the source of transcript annotation, the first time you run haplo
with a particular cache it will spend some time scanning transcript locations in the cache.
./haplo -i input.vcf -o out.txt -cache
The default output format is a simple tab-delimited file reporting all observed non-reference haplotypes. It has the following fields:
The altered haplotype sequences can be obtained by switching to JSON output using --json
which will display them by default.
Each transcript analysed is summarised as a JSON object written to one line of the output file.
The JSON output structure matches the format of the transcript haplotype REST endpoint.
You may exclude fields in the JSON from being exported with --dont_export field1,field2
. This may be used, for example, to exclude the full haplotype sequence and aligned sequences from the output with --dont_export seq,aligned_sequences
.
Note JSON output does not currently include side-loaded frequency data.
The transcript haplotype REST endpoint. returns arrays of protein_haplotypes and cds_haplotypes for a given transcript. The default haplotype record includes:
The REST service does not return raw sequences, sample-haplotype assignments and the aligned sequences used to generate differences by default.
Haplotypes may be flagged with one or more of the following:
haplo
can make use of a fast compiled alignment algorithm from the bioperl-ext package; this can speed up analysis, particularly in longer transcripts where insertions and/or deletions are introduced. The bioperl-ext package is no longer maintained and requires some tweaking to install. The following instructions install the package in $HOME/perl5
; edit PREFIX=[path]
to change this. You may also need to edit the export
command to point to the path created for the architecture on your machine.
git clone https://github.com/bioperl/bioperl-ext.git
cd bioperl-ext/Bio/Ext/Align/
perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL
perl Makefile.PL PREFIX=~/perl5
make
make install
cd -
export PERL5LIB=${PERL5LIB}:${HOME}/perl5/lib/x86_64-linux-gnu/perl/5.22.1/
If successful the following should print OK
:
perl -MBio::Tools::dpAlign -e"print qq{OK\n}"
variant_recoder
is a tool for translating between different variant encodings. It accepts as input any format supported by VEP (VCF, variant ID, HGVS), with extensions to allow for parsing of potentially ambiguous HGVS notations. For each input variant, variant_recoder
reports all possible encodings including variant IDs from all sources imported into the Ensembl database and HGVS (genomic, transcript and protein), reported on Ensembl, RefSeq and LRG sequences.
variant_recoder
depends on database access for identifier lookup, and cannot be used in offline mode as per VEP. The output format is JSON and the JSON perl module is required.
./variant_recoder --id [input_data_string]
./variant_recoder -i [input_file] --species [species]
Output is a JSON array of objects, one per input variant, with the following keys:
Use --pretty
to pre-format and indent JSON output.
Example output:
./variant_recoder --id "AGT:p.Met259Thr" --pretty
[
{
"warnings" : [
"Possible invalid use of gene or protein identifier 'AGT' as HGVS reference; AGT:p.Met259Thr may resolve to multiple genomic locations"
],
"C" : {
"input" : "AGT:p.Met259Thr",
"id" : [
"rs699",
"CM920010",
"COSV64184214"
],
"hgvsg" : [
"NC_000001.11:g.230710048A>G"
],
"hgvsc" : [
"ENST00000366667.6:c.776T>C",
"ENST00000679684.1:c.776T>C",
"ENST00000679738.1:c.776T>C",
"ENST00000679802.1:c.776T>C",
"ENST00000679854.1:n.1287T>C",
"ENST00000679957.1:c.776T>C",
"ENST00000680041.1:c.776T>C",
"ENST00000680783.1:c.776T>C",
"ENST00000681269.1:c.776T>C",
"ENST00000681347.1:n.1287T>C",
"ENST00000681514.1:c.776T>C",
"ENST00000681772.1:c.776T>C",
"NM_001382817.3:c.776T>C",
"NM_001384479.1:c.776T>C"
],
"hgvsp" : [
"ENSP00000355627.5:p.Met259Thr",
"ENSP00000505981.1:p.Met259Thr",
"ENSP00000505063.1:p.Met259Thr",
"ENSP00000505184.1:p.Met259Thr",
"ENSP00000506646.1:p.Met259Thr",
"ENSP00000504866.1:p.Met259Thr",
"ENSP00000506329.1:p.Met259Thr",
"ENSP00000505985.1:p.Met259Thr",
"ENSP00000505963.1:p.Met259Thr",
"ENSP00000505829.1:p.Met259Thr",
"NP_001369746.2:p.Met259Thr",
"NP_001371408.1:p.Met259Thr"
],
"spdi" : [
"NC_000001.11:230710047:A:G"
]
}
}
]
variant_recoder
shares many of the same command line flags as VEP. Others are unique to variant_recoder
.
-id|--input_data [input_string]
: a single variant as a string.-i|--input_file [input_file]
: input file containing one or more variants, one per line. Mixed formats disallowed.--species
: species to use (default: homo_sapiens).--grch37
: use GRCh37 assembly instead of GRCh38.--genomes
: set database parameters for Ensembl Genomes species.--pretty
: write pre-formatted indented JSON.--fields [field1,field2]
: limit output fields. Comma-separated list, one or more of: id
, hgvsg
, hgvsc
, hgvsp
, spdi
.--vcf_string
: report VCF--var_synonyms
: report variation synonyms--mane_select
: report MANE Select transcripts in HGVS format--host [db_host]
: change database host from default ensembldb.ensembl.org
(UK); geographic mirrors are useastdb.ensembl.org
(US East Coast) and asiadb.ensembl.org
(Asia). --user
, --port
and --pass
may also be set.--pick
, --per_gene
, --pick_allele
, --pick_allele_gene
, --pick_order
: set and customise transcript selection process, see VEP documentation