Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

Possible bug with flag_pick_allele #1684

Open GACGAMA opened 1 month ago

GACGAMA commented 1 month ago

System

VEP 111 Docker/Singularity

Full VEP command line

singularity exec -H /scratch4/nsobrei2/references/vep_cache_singularity /scratch4/nsobrei2/singularities/vep.sif vep --fork 2 --cache -i vcf --species homo_sapiens --assembly GRCh38  --buffer_size 1000  --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --format vcf --input_file \$1 --vcf --output_file \$vep_path/annotated/\$filenames2.unfiltered.vcf --force_overwrite --pubmed  --regulatory --polyphen b --af --max_af --af_1kg --af_gnomade --af_gnomadg --gene_phenotype --plugin dbNSFP,"\$dbNSFP",SIFT_pred,SIFT4G_pred,Polyphen2_HDIV_pred,CADD_phred,MetaRNN_score,MetaRNN_pred,REVEL_score,BayesDel_noAF_score,BayesDel_noAF_pred,ClinPred_score,ClinPred_pred,clinvar_id,clinvar_clnsig,clinvar_trait,clinvar_review,clinvar_var_source,clinvar_OMIM_id --custom file=/scratch4/nsobrei2/references/gnomad_40/gnomad4.0_GRCh38_combined_af.vcf.bgz,short_name=gnomad_4_0_genomes_and_exomes,fields=AF%AC%AN%nhomalt,format=vcf,type=exact,coords=0 

Problem

As you can see from my command, I'm annotating my file with many different flags. With this command, I get a multisample VCF file with 118028 variants. But, if I filter this file with PICK = 1, I get only 110713 variants.

Is this the expected behaviour of flagging variants? I tought all variants would always have at least one trascript flagged

olaaustine commented 1 month ago

Hi @GACGAMA, Hope you are well? Please can you share your filter_vep command ? According to documentation for --flag_pick_allele, the PICK flag is added to the chosen block of consequence data. Let us know if you expect something different. Thank you Ola.

GACGAMA commented 3 weeks ago

Hi @olaaustine

I'm currently using: singularity exec -H /scratch4/nsobrei2/references/vep_cache_singularity /scratch4/nsobrei2/singularities/vep.sif filter_vep --force_overwrite --input_file {1} --output_file /scratch4/nsobrei2/ggama1/OMIM_GENES/vep/filtering/PICK1_STEP2.vcf --only_matched --filter "PICK = 1"

Before that I was trying with R by expanding the CSQ column. For both methods I get 118028 variants without filtering

To count variants I used both BCFtools and R. In R, I filtered by expanding the CSQ column and removing duplicates based on chrom, pos, alt, ref on a normalized VCF For bcftools bcftools query -f '%POS\n' myvcf | wc -l

But when I use the PICK column to filter, in R, I get only 110713 variants If I use the filter_vep, I get only 110722 variants.

I expected that ensembl will always PICK at least one transcript with --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length

olaaustine commented 2 weeks ago

Hi @GACGAMA, Hope you are well? If possible can you share your input file or a subset of the variants in your input so I can try to recreate. Thank you very much Ola.

GACGAMA commented 1 week ago

Hi @olaaustine

I can send a minimal working example vcf. Is there any email I can send this data? I`ve been able to reproduce this issue on multiple files

olaaustine commented 1 week ago

Hi @GACGAMA, If possible can you share the minimal working example VCF here? If thats not possible, please do not hesitate to send them using this link https://www.ensembl.org/Help/Contact with the title "flag_pick_allele VCF file" Hoping to hear from you soon. Thank you very much Ola.