Need to add sample specific genotype information to the data frame generated using VEP annotated vcf file.

Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants

https://www.ensembl.org/vep

Apache License 2.0

437 stars 149 forks source link

Need to add sample specific genotype information to the data frame generated using VEP annotated vcf file. #1650

Open AppWick-hub opened 3 months ago

AppWick-hub commented 3 months ago

I am trying to generate a data frame (in .table format) from a VEP annotated multi sample vcf file. I am using the following code:

./vep -i ./input.vcf --offline --format vcf --assembly GRCh37 --everything --per_gene --species homo_sapiens \
--dir_cache /*/*/ --dir_plugins /*/*/ --plugin *, file=./*.txt \
--force_overwrite --tab --fields "Location,Allele,Gene,SYMBOL,Consequence, MAX_AF,Position in cDNA,Amino acid change,Codon change" -o */*.table

Using this, I am able to generate a .table file, that contains the desired output with some exceptions. These are as follow:

Is it possible to add sample specific GT, AD, DP, GQ information to this file?
There are some columns, such as Position in "cDNA,Amino acid change,Codon change" are returning with missing values, even though they have annotations in the VEP annotated vcf file (in CSQ field).

dglemos commented 3 months ago

Hi @AppWick-hub,

Is it possible to add sample specific GT, AD, DP, GQ information to this file?

To keep the sample info in the output you should use the VCF output format (--vcf). This type of data is not included in the tab output format.

There are some columns, such as Position in "cDNA,Amino acid change,Codon change" are returning with missing values, even though they have annotations in the VEP annotated vcf file (in CSQ field).

For the tab format, the selected fields have to be present in the default output columns. "cDNA,Amino acid change,Codon change" are not part of the output, the correct column names are "cDNA_position,Amino_acids,Codons" Here you can read more about the --fields option: http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_fields

AppWick-hub commented 2 months ago

Wondering, how can we annotate sample specific GT, AD, DP, GQ information to the tab format data frame generated. The fields "cDNA_position,Amino_acids,Codons" retuning with values now. Thanks for clarifying.

dglemos commented 2 months ago

You could have a script to attach the sample info from the input VCF to the tab output. As an alternative, you could output VEP in VCF format and parse the file with bcftools split-vep. This plugin converts from vcf to tab with the option to print selected values in the same line.

AppWick-hub commented 2 months ago

Thanks for the tip dglemos. The plugin https://samtools.github.io/bcftools/howtos/plugin.split-vep.html works well.

AppWick-hub commented 2 months ago

Hi, Another issue came up recently similar to the one mentioned above.

When I use --plugin dbNSFP,./dbNSFP4.gz,GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue to extract few columns from the dbNSFP database, the generated columns (GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue) in the extracted tab output file, all the rows contain "invalid_value" text with no actual numerical value or gene name/tissue expression texts.

dglemos commented 2 months ago

Hi @AppWick-hub, The dbNSFP file name dbNSFP4.gz does not include the version. Which dbNSFP version are you using? Did you check if your dbNSFP file contains the columns: GDI, GO_biological_process, GTEx_V8_eQTL_gene, GTEx_V8_eQTL_tissue

AppWick-hub commented 2 months ago

Hi, The dbNSFP file name is dbNSFP4.7a_grch37.gz. In dbNSFP4.7a.readme.txt, the column names GDI, GO_biological_process, GTEx_V8_eQTL_gene, GTEx_V8_eQTL_tissue are there. Also the generated .table file contains these columns. Surprisingly, there is a warning while generating the .table file which is : _the following columns were not found in file header: GDI, GO_biologicalprocess.

dglemos commented 2 months ago

Can you send all the commands you run to generate the files?

AppWick-hub commented 2 months ago

Sure. Following are the commands I used:

./vep \
--offline --format vcf --assembly GRCh37 \
--dir_cache /*/*/ \
--force_overwrite \
--everything --per_gene --species homo_sapiens \
--fork 40 \
--dir_plugins /*/*/ \
--plugin pLI,file=./pLI_values.txt \
--plugin dbNSFP,./dbNSFP4.7a_grch37.gz,GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue \
-i ./*.vcf \
--tab --fields "Location,Allele,SYMBOL,Consequence,GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue" -o stdout | \
filter_vep --filter "MAX_AF < 0.01 on not MAX_AF" \
-o ./*.table

dglemos commented 2 months ago

Thanks. Which commands did you run to prepare the file dbNSFP4.7a_grch37.gz?

AppWick-hub commented 2 months ago

As highlighted here: https://github.com/Ensembl/VEP_plugins/blob/release/111/dbNSFP.pm, when I tried to download the dbNSFP4.7a.zip file using the following command:

wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.7a.zip

it failed due to some connection issue. So I used:

wget https://dbnsfp.s3.amazonaws.com/dbNSFP4.7a.zip

as mentioned here: https://sites.google.com/site/jpopgen/dbNSFP. And then I just followed the remaining steps as mentioned here: https://github.com/Ensembl/VEP_plugins/blob/release/111/dbNSFP.pm.

dglemos commented 2 months ago

Can you send the first lines of your dbSNFP file?

Surprisingly, there is a warning while generating the .table file which is : the following columns were not found in file header: GDI, GO_biological_process.

Which command did you run to generate that file? Could you please send the command and the full warning message?

dglemos commented 2 months ago

I cannot see the GDI and GO_biological_process in your header but it's difficult to find the header names in the screenshot. Can you paste the header line here? Can you please double-check the file you downloaded has these two columns? For GTEx_V8_eQTL_gene and GTEx_V8_eQTL_tissue if there is no value then the column has . this means VEP will attach . to your variant. Please paste the VEP output here.