Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

Seeking clarity on --fields vs --custom usage #1721

Closed errcricket closed 2 months ago

errcricket commented 2 months ago

Greetings.

Trying to understand VEP a bit better. I am running VEP (version 112, perl 5.26.3) on a Linux server (CentOS) and have manually downloaded indexed cache GRCh38 from here. When I look under the untared homo_sapiens/112_GRCh38/ directory in the info.txt file I see the following:

species  homo_sapiens
assembly GRCh38
sift  b
polyphen b
source_polyphen   2.2.3
source_sift 6.2.1
source_genebuild  2014-07
source_gencode GENCODE 46
source_assembly   GRCh38.p14
variation_cols chr,variation_name,failed,somatic,start,end,allele_string,strand,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,var_synonyms,AF,AFR,AMR,EAS,EUR,SAS,gnomADe,gnomADe_AFR,gnomADe_AMR,gnomADe_ASJ,gnomADe_EAS,gnomADe_FIN,gnomADe_NFE,gnomADe_OTH,gnomADe_SAS,gnomADg,gnomADg_AFR,gnomADg_AMI,gnomADg_AMR,gnomADg_ASJ,gnomADg_EAS,gnomADg_FIN,gnomADg_MID,gnomADg_NFE,gnomADg_OTH,gnomADg_SAS
source_COSMIC  98  
source_HGMD-PUBLIC   20204
source_ClinVar 202310
source_dbSNP   156 
source_1000genomes   phase3
source_gnomADe r2.1.1
source_gnomADg v3.1.2
var_type tabix
regulatory  1
cell_types  A549,A673,B,B_(PB),CD14+_monocyte_(PB),CD14+_monocyte_1,CD4+_CD25+_ab_Treg_(PB),CD4+_ab_T,CD4+_ab_T_(PB)_1,CD4+_ab_T_(PB)_2,CD4+_ab_T_(Th),CD4+_ab_T_(VB),CD8+_ab_T_(CB),CD8+_ab_T_(PB),CMP_CD4+_1,CMP_CD4+_2,CMP_CD4+_3,CM_CD4+_ab_T_(VB),DND-41,EB_(CB),EM_CD4+_ab_T_(PB),EM_CD8+_ab_T_(VB),EPC_(VB),GM12878,H1-hESC_2,H1-hESC_3,H9_1,HCT116,HSMM,HUES48,HUES6,HUES64,HUVEC,HUVEC-prol_(CB),HeLa-S3,HepG2,K562,M0_(CB),M0_(VB),M1_(CB),M1_(VB),M2_(CB),M2_(VB),MCF-7,MM.1S,MSC,MSC_(VB),NHLF,NK_(PB),NPC_1,NPC_2,NPC_3,PC-3,PC-9,SK-N.,T_(PB),Th17,UCSF-4,adrenal_gland,aorta,astrocyte,bipolar_neuron,brain_1,cardiac_muscle,dermal_fibroblast,endodermal,eosinophil_(VB),esophagus,foreskin_fibroblast_2,foreskin_keratinocyte_1,foreskin_keratinocyte_2,foreskin_melanocyte_1,foreskin_melanocyte_2,germinal_matrix,heart,hepatocyte,iPS-15b,iPS-20b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,large_intestine,left_ventricle,leg_muscle,lung_1,lung_2,mammary_epithelial_1,mammary_epithelial_2,mammary_myoepithelial,monocyte_(CB),monocyte_(VB),mononuclear_(PB),myotube,naive_B_(VB),neuron,neurosphere_(C),neurosphere_(GE),neutro_myelocyte,neutrophil_(CB),neutrophil_(VB),osteoblast,ovary,pancreas,placenta,psoas_muscle,right_atrium,right_ventricle,sigmoid_colon,small_intestine_1,small_intestine_2,spleen,stomach_1,stomach_2,thymus_1,thymus_2,trophoblast,trunk_muscle
source_regbuild   1.0

When running VEP, if the --fields of interest are Consequence,IMPACT,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,AF,gnomAD_AF,CLIN_SIG,PUBMED, does that mean for example, for ClinVar, VEP annotates a provided vcf file with clinical significance information because (as implied in the info.txt file) the cache files already include ClinVar info? Is that correct or do I (also) need to provide a ClinVar file that I have downloaded, and use the --custom flag?

--custom CLINVAR.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN

The latter seems to be for a different file version than what comes in the cache, but I am not seeing clinical sig in the results -- so I don't know if I made a mistake or if it is because ClinVar does not have pathogenicity assertions for the provided variants. I have run VEP with and without the custom flag (both times with --fields list) and ran diff on the two outputs. There are differences, but they are likely related to # lines (finding differences from visual inspection was ineffective).

For reference, this is the code I executed.

rule vep:
   input:
      vcf = wes_output + 'genomics_db/snp_indel_final.recalibrated.de.nm.vcf.gz'
   params:
      ref = REF_FASTQ,
      cache_dir = '/x/xx/xxx/xxxx/installed_packages/vep_cache/', #/homo_sapiens
      clinvar = 'CLINVAR.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN',
   output:
      vcf = wes_output + 'genomics_db/snp_indel_final.recalibrated.de.nm.vep.vcf.gz',
      html = wes_output + 'genomics_db/snp_indel_final.recalibrated.de.nm.vep.html'
   threads: 16
   log:
      log_output = vep_log + 'vep.log'
   shell:
         "vep -i {input.vcf} --vcf -o {output.vcf} --format vcf --stats_file {output.html} \
            --offline --cache --dir_cache {params.cache_dir} --fork {threads} --compress_output bgzip \
            --sift b --fields   Consequence,IMPACT,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,AF,gnomAD_AF,CLIN_SIG,PUBMED \ 
            #--custom {params.clinvar} 
            2> {log.log_output}"

Thank you in advance!

likhitha-surapaneni commented 2 months ago

Hi @errcricket ,

Thank you for providing us the details for the issue. Can you please try using check_existing alongside --fields? This flag helps to identify known variants colocated with input variant. VEP's variant cache contains variants from dbSNP and other sources.

errcricket commented 2 months ago

Thank you for the response. Can you first address this question please?

_When running VEP, if the --fields of interest are Consequence,IMPACT,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,AF,gnomAD_AF,CLIN_SIG,PUBMED, does that mean for example, for ClinVar, VEP annotates a provided vcf file with clinical significance information because (as implied in the info.txt file) the cache files already include ClinVar info?

Or do I (also) need to provide a ClinVar file that I have downloaded, and use the --custom flag?_

likhitha-surapaneni commented 2 months ago

Hi @errcricket , Yes, VEP annotates the provided VCF file with clinical significance information from the cache with version mentioned in the info.txt file.

While we try to include the most recent variant data in each Ensembl release, some projects release data more frequently than we do. If you wish to use the latest annotations from such projects, the data files can be used withcustom flag.

Please let us know if you have any more questions.

Kind regards, Likhitha

errcricket commented 2 months ago

Thank you @likhitha-surapaneni, I think I have what I need.