Open AppWick-hub opened 3 months ago
Hi @AppWick-hub,
Is it possible to add sample specific GT, AD, DP, GQ information to this file?
To keep the sample info in the output you should use the VCF output format (--vcf
). This type of data is not included in the tab output format.
There are some columns, such as Position in "cDNA,Amino acid change,Codon change" are returning with missing values, even though they have annotations in the VEP annotated vcf file (in CSQ field).
For the tab format, the selected fields have to be present in the default output columns.
"cDNA,Amino acid change,Codon change" are not part of the output, the correct column names are "cDNA_position,Amino_acids,Codons"
Here you can read more about the --fields
option: http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_fields
Wondering, how can we annotate sample specific GT, AD, DP, GQ information to the tab format data frame generated. The fields "cDNA_position,Amino_acids,Codons" retuning with values now. Thanks for clarifying.
You could have a script to attach the sample info from the input VCF to the tab output. As an alternative, you could output VEP in VCF format and parse the file with bcftools split-vep. This plugin converts from vcf to tab with the option to print selected values in the same line.
Thanks for the tip dglemos. The plugin https://samtools.github.io/bcftools/howtos/plugin.split-vep.html works well.
Hi, Another issue came up recently similar to the one mentioned above.
When I use
--plugin dbNSFP,./dbNSFP4.gz,GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue
to extract few columns from the dbNSFP database, the generated columns (GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue) in the extracted tab output file, all the rows contain "invalid_value" text with no actual numerical value or gene name/tissue expression texts.
Hi @AppWick-hub,
The dbNSFP file name dbNSFP4.gz
does not include the version. Which dbNSFP version are you using?
Did you check if your dbNSFP file contains the columns: GDI, GO_biological_process, GTEx_V8_eQTL_gene, GTEx_V8_eQTL_tissue
Hi, The dbNSFP file name is dbNSFP4.7a_grch37.gz. In dbNSFP4.7a.readme.txt, the column names GDI, GO_biological_process, GTEx_V8_eQTL_gene, GTEx_V8_eQTL_tissue are there. Also the generated .table file contains these columns. Surprisingly, there is a warning while generating the .table file which is : _the following columns were not found in file header: GDI, GO_biologicalprocess.
Can you send all the commands you run to generate the files?
Sure. Following are the commands I used:
./vep \
--offline --format vcf --assembly GRCh37 \
--dir_cache /*/*/ \
--force_overwrite \
--everything --per_gene --species homo_sapiens \
--fork 40 \
--dir_plugins /*/*/ \
--plugin pLI,file=./pLI_values.txt \
--plugin dbNSFP,./dbNSFP4.7a_grch37.gz,GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue \
-i ./*.vcf \
--tab --fields "Location,Allele,SYMBOL,Consequence,GDI,GO_biological_process,GTEx_V8_eQTL_gene,GTEx_V8_eQTL_tissue" -o stdout | \
filter_vep --filter "MAX_AF < 0.01 on not MAX_AF" \
-o ./*.table
Thanks.
Which commands did you run to prepare the file dbNSFP4.7a_grch37.gz
?
As highlighted here: https://github.com/Ensembl/VEP_plugins/blob/release/111/dbNSFP.pm, when I tried to download the dbNSFP4.7a.zip file using the following command:
wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.7a.zip
it failed due to some connection issue. So I used:
wget https://dbnsfp.s3.amazonaws.com/dbNSFP4.7a.zip
as mentioned here: https://sites.google.com/site/jpopgen/dbNSFP. And then I just followed the remaining steps as mentioned here: https://github.com/Ensembl/VEP_plugins/blob/release/111/dbNSFP.pm.
Can you send the first lines of your dbSNFP file?
Surprisingly, there is a warning while generating the .table file which is : the following columns were not found in file header: GDI, GO_biological_process.
Which command did you run to generate that file? Could you please send the command and the full warning message?
I cannot see the GDI
and GO_biological_process
in your header but it's difficult to find the header names in the screenshot. Can you paste the header line here?
Can you please double-check the file you downloaded has these two columns?
For GTEx_V8_eQTL_gene
and GTEx_V8_eQTL_tissue
if there is no value then the column has .
this means VEP will attach .
to your variant.
Please paste the VEP output here.
I am trying to generate a data frame (in .table format) from a VEP annotated multi sample vcf file. I am using the following code:
Using this, I am able to generate a .table file, that contains the desired output with some exceptions. These are as follow:
Is it possible to add sample specific GT, AD, DP, GQ information to this file?
There are some columns, such as Position in "cDNA,Amino acid change,Codon change" are returning with missing values, even though they have annotations in the VEP annotated vcf file (in CSQ field).