Ensembl / VEP_plugins

Plugins for the Ensembl Variant Effect Predictor (VEP)
Apache License 2.0
132 stars 114 forks source link

Missing ZYG field from VEP output #724

Open IanCodes opened 1 week ago

IanCodes commented 1 week ago

Hello,

I have been using VEP V111 to annotated Freebayes VCF files. We have noticed that the ZYG field is missing from the output. Is this expected?

The command line for VEP was: qsub -pe smp.pe 4 -V -cwd -N vep_fb_005SN_S25 -b y 'vep --offline --cache --dir_cache REDACTED_PATH/.conda/envs/VEP111/ --species homo_sapiens --dir_plugins REDACTED_PATH/.vep/Plugins/ --everything --tab --assembly GRCh38 -i 005SN_S25_hg38_freebayes136_MAPQ20_QUAL20_COV10_controls_subtracted.vcf.gz -o 005SN_S25_hg38_freebayes136_MAPQ20_QUAL20_COV10_controls_subtracted_v111.vep --force_overwrite --fork 4 --plugin AlphaMissense,file=REDACTED_PATH/.conda/envs/VEP111/AlphaMissense_data/AlphaMissense_hg38.tsv.gz --plugin CADD,snv=REDACTED_PATH/.conda/envs/VEP111/CADD_data/whole_genome_SNVs.tsv.gz,indels=REDACTED_PATH/.conda/envs/VEP111/CADD_data/gnomad.genomes.r4.0.indel.tsv.gz,force_annotate=1 --plugin gnomADc,REDACTED_PATH/.conda/envs/VEP111/gnomad_data/gnomad.ch.genomesv3.tabbed.tsv.gz --plugin REVEL,file=REDACTED_PATH/.conda/envs/VEP111/REVEL_data/new_tabbed_revel_grch38.tsv.gz --plugin SpliceAI,snv=REDACTED_PATH/.conda/envs/VEP111/spliceai_data/spliceai_scores.raw.snv.hg38.vcf.gz,indel=REDACTED_PATH/.conda/envs/VEP111/spliceai_data/spliceai_scores.raw.indel.hg38.vcf.gz'

An example of the VCF input follows.

Thank you, Ian

##fileformat=VCFv4.2
##fileDate=20240413
##source=freeBayes v1.3.6
##reference=REDACTED_PATH/hg38.analysisSet.fa
##contig=<ID=chr10,length=133797422>
etc...
##phasing=none
##commandline="freebayes -f RREDACTED_PATH/hg38.analysisSet.fa --min-mapping-quality 20 --min-base-quality 20 --ploidy 2 --min-coverage 10 --bam 259SB_S48_trimmoPE_bwa-mem_hg38_PP-P_fixmate_sort_flagDups.bam"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
##INFO=<ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=PQR,Number=1,Type=Float,Description="Reference allele quality sum in phred for partial observations">
##INFO=<ID=PQA,Number=A,Type=Float,Description="Alternate allele quality sum in phred for partial observations">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
##INFO=<ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
##INFO=<ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RUN,Number=A,Type=Integer,Description="Run length: the number of consecutive repeats of the alternate allele in the reference genome">
##INFO=<ID=RPP,Number=A,Type=Float,Description="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RPPR,Number=1,Type=Float,Description="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RPL,Number=A,Type=Float,Description="Reads Placed Left: number of reads supporting the alternate balanced to the left (5') of the alternate allele">
##INFO=<ID=RPR,Number=A,Type=Float,Description="Reads Placed Right: number of reads supporting the alternate balanced to the right (3') of the alternate allele">
##INFO=<ID=EPP,Number=A,Type=Float,Description="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=EPPR,Number=1,Type=Float,Description="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=DPRA,Number=A,Type=Float,Description="Alternate allele depth ratio.  Ratio between depth in samples with each called alternate allele and those without.">
##INFO=<ID=ODDS,Number=1,Type=Float,Description="The log odds ratio of the best genotype combination to the second-best.">
##INFO=<ID=GTI,Number=1,Type=Integer,Description="Number of genotyping iterations required to reach convergence or bailout.">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
##INFO=<ID=NUMALT,Number=1,Type=Integer,Description="Number of unique non-reference alleles in called genotypes at this position.">
##INFO=<ID=MEANALT,Number=A,Type=Float,Description="Mean number of unique non-reference allele observations per sample with the corresponding alternate alleles.">
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">
##INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">
##INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
##INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
##INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">
##INFO=<ID=END,Number=1,Type=Integer,Description="Last position (inclusive) in gVCF output record.">
##INFO=<ID=technology.ILLUMINA,Number=A,Type=Float,Description="Fraction of observations supporting the alternate observed in reads from ILLUMINA">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  259SB_S48
chr10   47428   .       G       A       7.87328e-13     .       AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=32;DPB=32;DPRA=0;EPP=3.0103;EPPR=3.29983;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=29.3549;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=1110;RO=30;RPL=0;RPP=7.35324;RPPR=3.29983;RPR=2;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=10;SRP=10.2485;SRR=20;TYPE=snp;technology.ILLUMINA=1   GT:DP:AD:RO:QR:AO:QA:GL 0/0:32:30,2:30:1110:2:74:0,-2.60708,-93.1853
chr10   47565   .       C       T       2.78594e-11     .       AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=29;DPB=29;DPRA=0;EPP=7.35324;EPPR=5.02092;GTI=0;LEN=1;MEANALT=1;MQM=43.5;MQMR=44.2593;NS=1;NUMALT=1;ODDS=25.7724;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=987;RO=27;RPL=1;RPP=3.0103;RPPR=3.09072;RPR=1;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=13;SRP=3.09072;SRR=14;TYPE=snp;technology.ILLUMINA=1     GT:DP:AD:RO:QR:AO:QA:GL 0/0:29:27,2:27:987:2:74:0,-1.86736,-74.7447
chr10   47615   .       C       T       3.41769e-06     .       AB=0.1;ABP=30.8051;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=20;DPB=20;DPRA=0;EPP=3.0103;EPPR=4.9405;GTI=0;LEN=1;MEANALT=1;MQM=43.5;MQMR=38.3889;NS=1;NUMALT=1;ODDS=14.0551;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=666;RO=18;RPL=2;RPP=7.35324;RPPR=15.074;RPR=0;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=7;SRP=4.9405;SRR=11;TYPE=snp;technology.ILLUMINA=1       GT:DP:AD:RO:QR:AO:QA:GL 0/1:20:18,2:18:666:2:74:-0.841915,0,-47.5773
chr10   47663   .       C       T       322.572 .       AB=0;ABP=0;AC=2;AF=1;AN=2;AO=12;CIGAR=1X;DP=12;DPB=12;DPRA=0;EPP=21.1059;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=41.5833;MQMR=0;NS=1;NUMALT=1;ODDS=21.2407;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=444;QR=0;RO=0;RPL=7;RPP=3.73412;RPPR=0;RPR=5;RUN=1;SAF=6;SAP=3.0103;SAR=6;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1    GT:DP:AD:RO:QR:AO:QA:GL 1/1:12:0,12:0:0:12:444:-36.1507,-3.61236,0
chr10   47771   .       CAGA    TAGC    5.02082e-12     .       AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X2M1X;DP=27;DPB=27;DPRA=0;EPP=7.35324;EPPR=3.79203;GTI=0;LEN=4;MEANALT=1;MQM=31;MQMR=59.48;NS=1;NUMALT=1;ODDS=27.4873;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=904;RO=25;RPL=0;RPP=7.35324;RPPR=7.26639;RPR=2;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=14;SRP=3.79203;SRR=11;TYPE=complex;technology.ILLUMINA=1       GT:DP:AD:RO:QR:AO:QA:GL 0/0:27:25,2:25:904:2:74:0,-2.54891,-76.0633
etc.
dglemos commented 1 week ago

Hi @IanCodes, The column ZYG is populated by option --individual or --individual_zyg. I don't see any of those options in your command.

The option --everything switches on the following options: https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_everything this does not include the individuals options.

Let me know if you have more questions.

Best wishes, Diana

IanCodes commented 1 week ago

@dglemos Thank you very much for your reply Diana appears to be just what i need.

dglemos commented 1 week ago

I'm glad it worked!

Best wishes, Diana

IanCodes commented 1 week ago

Apologies for the follow up, but I am observing that when I use '--individual_zyg all' I receive fewer lines of VEP output when using the same VCF file. Are there types of variant that are not processed using this flag?

dglemos commented 1 week ago

Can you show me an example please?

IanCodes commented 1 week ago

Thank you for your fast response. I have with and without ZYG files, but they are big. What would be the best method of sharing them?

dglemos commented 1 week ago

You can send your files to helpdesk@ebi.ac.uk or if they are too big to send by email, you can send a sample of the files.

IanCodes commented 1 week ago

I have extracted chr10 results for the VEP output with and without ZYG . Thank you. VEP_with_and_without_ZYG.zip

dglemos commented 1 week ago

Thank you! Can you also send the input files?

IanCodes commented 1 week ago

Sorry for the delay here is the chr10 part of the VCF file (and headers). There were a number of plugins with huge files. No sure how well you'll be able to repeat my analysis. Let me know if you need anything else. Thank you. chr10.zip

dglemos commented 6 days ago

The variants missing from the output with_ZYG_chr10.vep have genotype HOMREF (homozygous reference). An example:

chr10   47461   .       G       A       1.08712e-11     .       AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=30;DPB=30;DPRA=0;EPP=3.0103;EPPR=10.7656;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=58.6786;NS=1;NUMALT=1;ODDS=26.7136;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=1036;RO=28;RPL=2;RPP=7.35324;RPPR=5.80219;RPR=0;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=12;SRP=4.25114;SRR=16;TYPE=snp;technology.ILLUMINA=1      GT:DP:AD:RO:QR:AO:QA:GL 0/0:30:28,2:28:1036:2:74:0,-2.00502,-86.2858

Using the option --individual_zyg these should still be in the output.

Can you try running vep again without extra options, something like this:

vep --offline --cache --dir_cache REDACTED_PATH/.conda/envs/VEP111/ --species homo_sapiens --tab --assembly GRCh38 -i <input_file> -o <output_file> --individual_zyg all
IanCodes commented 6 days ago

@dglemos I ran the command using the chr10.vep. HOMREF variants are present in the output. e.g. chr10_47461_G/G chr10:47461 G ENSG00000237297 ENST00000416477 Transcript downstream_gene_variant - - - - - - 401LF_S113:HOMREF MODIFIER 577 1 -

Does this mean there is a conflict with one of the plugins?

dglemos commented 6 days ago

Thanks for checking! I don't see how any of these plugins would interfere with the number of lines in the output.

Can you please try the following commands:

vep --offline --cache --dir_cache REDACTED_PATH/.conda/envs/VEP111/ --species homo_sapiens --tab --assembly GRCh38 -i <input_file> -o <output_file> --individual_zyg all --everything
vep \
--offline \
--cache \
--dir_cache REDACTED_PATH/.conda/envs/VEP111/ \
--species homo_sapiens \
--tab \
--assembly GRCh38 \
-i <input_file> \
-o <output_file> \
--individual_zyg all \
--plugin AlphaMissense,file=REDACTED_PATH/.conda/envs/VEP111/AlphaMissense_data/AlphaMissense_hg38.tsv.gz \
--plugin CADD,snv=REDACTED_PATH/.conda/envs/VEP111/CADD_data/whole_genome_SNVs.tsv.gz,indels=REDACTED_PATH/.conda/envs/VEP111/CADD_data/gnomad.genomes.r4.0.indel.tsv.gz,force_annotate=1 \
--plugin gnomADc,REDACTED_PATH/.conda/envs/VEP111/gnomad_data/gnomad.ch.genomesv3.tabbed.tsv.gz \
--plugin REVEL,file=REDACTED_PATH/.conda/envs/VEP111/REVEL_data/new_tabbed_revel_grch38.tsv.gz \
--plugin SpliceAI,snv=REDACTED_PATH/.conda/envs/VEP111/spliceai_data/spliceai_scores.raw.snv.hg38.vcf.gz,indel=REDACTED_PATH/.conda/envs/VEP111/spliceai_data/spliceai_scores.raw.indel.hg38.vcf.gz
IanCodes commented 6 days ago

Hello.

With only --individual_zyg I get 49509 lines in the VEP file With --individual_zyg all --everything I get 51653 lines With --individual_zyg all + plugins I get 49527 lines

dglemos commented 6 days ago

Can you please send the output for chr10_47461_G/G in all of those output files?

IanCodes commented 5 days ago

Thank you for you continuing effort!

--individual_zyg

chr10_47461_G/G chr10:47461     G       ENSG00000237297 ENST00000416477 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        577     1       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000561967 Transcript      3_prime_UTR_variant     1009    -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        -       -1      -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000562809 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        33      -1      -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000563456 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        244     -1      -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000564130 Transcript      synonymous_variant      1869    829     277     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000567466 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        117     -1      -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000568584 Transcript      synonymous_variant      989     931     311     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000568866 Transcript      synonymous_variant      859     820     274     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -

--individual_zyg all --everything

chr10_47461_G/G chr10:47461     G       ENSG00000237297 ENST00000416477 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        577     1       -       SNV     -       -       -       unprocessed_pseudogene  YES     -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000561967 Transcript      3_prime_UTR_variant     1009    -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        -       -1      -       SNV     TUBB8   HGNC    HGNC:20773      protein_coding  -       -       -       5       -       -       ENSP00000454878 -       A0A075B724.50   UPI0001B790EC   -       1       -       -       4/4     -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000562809 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        33      -1      -       SNV     TUBB8   HGNC    HGNC:20773      protein_coding  -       -       -       5       -       -       ENSP00000456899 -       A0A075B735.43   UPI0001B790ED   -       1       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000563456 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        244     -1      -       SNV     TUBB8   HGNC    HGNC:20773      retained_intron -       -       -       5       -       -       -       -       -       -       -       1       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000564130 Transcript      synonymous_variant      1869    829     277     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -       SNV     TUBB8   HGNC    HGNC:20773      protein_coding  -       -       -       5       -       -       ENSP00000457610 -       Q5SQY0.149      UPI0000197C79   -       1       -       -       4/4     -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000567466 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        117     -1      -       SNV     TUBB8   HGNC    HGNC:20773      nonsense_mediated_decay -       -       -       5       -       -       ENSP00000454914 -       A0A075B725.31   UPI0001B790EE   -       1       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000568584 Transcript      synonymous_variant      989     931     311     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -       SNV     TUBB8   HGNC    HGNC:20773      protein_coding  YES     NM_177987.3     -       1       P1      CCDS7051.1      ENSP00000456206 Q3ZCM7.150      -       UPI000007238E   -       1       -       -       4/4     -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000568866 Transcript      synonymous_variant      859     820     274     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -       SNV     TUBB8   HGNC    HGNC:20773      protein_coding  -       -       -       5       -       -       ENSP00000457062 -       A0A075B736.56   UPI000047C3D1   -       1       -       -       3/3     -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -       -

--individual_zyg all + plugins

chr10_47461_G/G chr10:47461     G       ENSG00000237297 ENST00000416477 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        577     1       -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000561967 Transcript      3_prime_UTR_variant     1009    -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        -       -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000562809 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        33      -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000563456 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        244     -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000564130 Transcript      synonymous_variant      1869    829     277     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000567466 Transcript      downstream_gene_variant -       -       -       -       -       -       401LF_S113:HOMREF       MODIFIER        117     -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000568584 Transcript      synonymous_variant      989     931     311     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
chr10_47461_G/G chr10:47461     G       ENSG00000261456 ENST00000568866 Transcript      synonymous_variant      859     820     274     L       Cta/Cta -       401LF_S113:HOMREF       LOW     -       -1      -       -       -       -       -       0.0001  0.9998  0.9977  0.9998  0.9393  0.7212  0.4243  0.0359  0.9998  29.4540 28.0000 2111922.0000    -       -
dglemos commented 5 days ago

The variant is in all outputs with the correct value 401LF_S113:HOMREF, this indicates the option --individual_zyg is behaving as expected.

For the different number of lines, the option --everything swicthes on --regulatory which reports if the variant overlaps regulatory regions.

Output example without --everything:

chr10_132898972_T/T chr10:132898972 T   ENSG00000176769 ENST00000368642 Transcript  intron_variant  -   -   -   -   -   -401LF_S113:HOMREF  MODIFIER    -   -1  -
chr10_132898972_T/T chr10:132898972 T   ENSG00000230098 ENST00000436942 Transcript  downstream_gene_variant -   -   -   -   -401LF_S113:HOMREF  MODIFIER    4932    1   -
chr10_132898972_T/T chr10:132898972 T   ENSG00000176769 ENST00000483040 Transcript  intron_variant,non_coding_transcript_variant    -   -401LF_S113:HOMREF  MODIFIER    -   -1  -

Output example with --everything:

chr10_132898972_T/T chr10:132898972 T   ENSG00000176769 ENST00000368642 Transcript  intron_variant  -   -   -   -   -   -401LF_S113:HOMREF  MODIFIER    -   -1  -   SNV TCERG1L HGNC    23533   protein_coding  YES -   -   -   -   -CCDS7662.2 ENSP00000357631 TCRGL_HUMAN -   UPI00004589C8   -   -   -   -   -   10/11   -   -   -   -   --
chr10_132898972_T/T chr10:132898972 T   ENSG00000230098 ENST00000436942 Transcript  downstream_gene_variant -   -   -   -   -401LF_S113:HOMREF  MODIFIER    4932    1   -   SNV TCERG1L-AS1 HGNC    49532   antisense   YES -   -   -   --
chr10_132898972_T/T chr10:132898972 T   ENSG00000176769 ENST00000483040 Transcript  intron_variant,non_coding_transcript_variant    -   -401LF_S113:HOMREF  MODIFIER    -   -1  -   SNV TCERG1L HGNC    23533   retained_intron -   -   -   -   -   -10/11  -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   --
chr10_132898972_T/T chr10:132898972 T   -   ENSR00000993699 RegulatoryFeature   regulatory_region_variant   -   -   -   -401LF_S113:HOMREF  MODIFIER    -   -   -   SNV -   -   -   enhancer    -   -   -   -   -   -

As you can see the last example has one more line because the variant overlaps a regulatory region.

IanCodes commented 2 days ago

Thank you for your help. Unfortunately it doesn't solve my problem. The original run used --everything and plugins. The expectation was that adding '--individual_zyg all' would just add another field to the output. It does, but lines of output are missing. So, for some variants --individual_zyg must be causing some difference.

dglemos commented 2 days ago

I cannot reproduce the issue. Can you send an example of a variant with missing data or missing from the output?

With only --individual_zyg I get 49509 lines in the VEP file With --individual_zyg all --everything I get 51653 lines With --individual_zyg all + plugins I get 49527 lines

Using this example, what are the counts when you run --individual_zyg all + --everything + plugins

IanCodes commented 2 days ago

These are the tallies of the various run. The numbers are a little different from previous that included the header lines. I 'll need to get back to you on the first part.

49467 --individual_zyg all 49467 --individual_zyg all --plugins 51551 --individual_zyg all --everything 51551 --individual_zyg all --everything --plugins 51587 --everything 51587 --everything --plugins