Closed schreyers closed 21 hours ago
For extra info, this is the VCF data I'm using:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr2 48030639 rs267608087;rs587782425;rs748452299 A AC 27 q30;R3x6;LowVariantFreq DP=9288;clinvar=1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic;cosmic=1|COSM308681;CSQT=1|MSH6|NM_000179.2|frameshift_variant,1|FBXO11|NM_001190274.1|downstream_gene_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:27:9166,122:9288:0.013:20:-11.6920:0.0000 chr2 209113113 rs121913499 G A 100 PASS DP=5374;clinvar=1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|pathogenic;cosmic=1|COSM28747;phyloP=3.794;CSQT=1|IDH1|NM_005896.3|missense_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:100:4280,1092:5374:0.203:20:-100.0000:0.0037 chr3 38184515 . TTG T 23 q30;LowVariantFreq DP=3950;cosmic=1|COSN28015016;CSQT=1|MYD88|NM_001172567.1|downstream_gene_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:23:3893,57:3950:0.014:20:-12.2691:0.0000
I also do some custom changes to merge the FORMAT column into the INFO column
Thanks :)
Hi @schreyers,
The problem is in your input file.
In your VEP command you specify the individual ind
(--individual_zyg ind
) however, your file does not have the ind in the header.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
It should be:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind
Thanks :)
So does the ZYG column not get added in like other columns?
I tried to specify "all" but got the same error.
Is there a way to get the ZYG column without changing the input file? Or if I just add an "ind" column with no data, will that solve the issue?
The data is already in the input file, the only thing missing is the column's name ind
.
Using the second variant from your example:
CHROM: chr3
POS: 38184515
ID: .
REF: TTG
ALT: T
QUAL: 23
FILTER: q30;LowVariantFreq
INFO: DP=3950;cosmic=1|COSN28015016;CSQT=1|MYD88|NM_001172567.1|downstream_gene_variant
FORMAT: GT:GQ:AD:DP:VF:NL:SB:NC
<missing column name>: 0/1:23:3893,57:3950:0.014:20:-12.2691:0.0000
Ah ok, previously there was the case number there but i took that out for privacy.
Ok, I'll give that a go :)
Much appreciated!
@dglemos sorry, it is working (sometimes). Had some small issues before, but resolved now (sometimes).
I was getting this error:
10 Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm line 284. Can't use string ("1") as a HASH ref while "strict refs" in use at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 720.
So the script to merge the FORMAT data into the INFO column, and also to add the "ind" column with the original data is:
$server_merge_format = 'gawk \'BEGIN { OFS="\t" }
{
if ($0 ~ /^#/) {
# Modify the header line
if ($0 ~ /^#CHROM/) {
n = split($0, headers, OFS);
headers[n] = "ind";
$0 = headers[1];
for (i = 2; i <= n; i++) {
$0 = $0 OFS headers[i];
}
print;
} else {
print;
}
} else {
# Capture INFO and FORMAT fields and merge them appropriately
info = $8;
format_header = $9;
format_values = $10;
format_values_orig = $10;
# Replace commas with hashtags in format_values, this stops data going missing.
gsub(",", "#", format_values);
# Ensure there are no line breaks or extra spaces in info
gsub(/[\r\n\t]/, "", info);
# Construct the new INFO field. The "|" is used instead of comma to stop errors.
new_info = info ";FORMAT=" format_header "|" format_values;
# Ensure there are no line breaks or extra spaces in new_info
gsub(/[\r\n\t]/, "", new_info);
# Construct the new line
new_line = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6 OFS $7 OFS new_info OFS format_header OFS format_values_orig;
print new_line;
}
}\' '. $i_input . ' > ' . $i_input_merged;
@dglemos Sorry to keep going back and forth with this.
I can't seem to see the difference in data that works and ones that throw an error.
This one throws an error:
chr1 115258730 . C . 100 PASS DP=3582;FORMAT=GT:GQ:AD:DP:VF:NL:SB:NC|0/0:100:3579:3582:0.001:20:-100.0000:0.0031;FORMAT=GT:GQ:AD:DP:VF:NL:SB:NC|0/0:100:3579:3582:0.001:20:-100.0000:0.0031 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:100:3579:3582:0.001:20:-100.0000:0.0031
This one works:
chr2 48030639 rs267608087;rs587782425;rs748452299 A AC 27 q30;R3x6;LowVariantFreq DP=9288;clinvar=1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic;cosmic=1|COSM308681;CSQT=1|MSH6|NM_000179.2|frameshift_variant,1|FBXO11|NM_001190274.1|downstream_gene_variant;FORMAT=GT:GQ:AD:DP:VF:NL:SB:NC|0/1:27:9166#122:9288:0.013:20:-11.6920:0.0000 GT:GQ:AD:DP:VF:NL:SB:NC 0/1:27:9166,122:9288:0.013:20:-11.6920:0.0000
For your first variant chr1 115258730 C .
the genotype is 0/0
which means it is homozygous for the reference allele.
Maybe this can help you understand a bit more about why the file has variants with this genotype: https://www.biostars.org/p/349323/ VEP will skip 'variants' like this.
Ok, that does make sense thanks :)
So would adding the --quiet flag solve this issue?
If not thats ok, as that article you sent says that you could just get the ZYG value from the GT variable
Can you please try with the option --process_ref_homs
?
Gave that a go but still got the same error:
ANNOTATION failed: 255 ANNOTATION CMD: ./vep --cache --dir_cache /opt/.vep/ --no_stats --check_existing --sift b --polyphen b --symbol --numbers --keep_csq --canonical --use_transcript_ref --exclude_predicted --transcript_version --protein --pubmed --variant_class --gene_phenotype --individual_zyg ind --allele_number --show_ref_allele --uploaded_allele --use_given_ref --hgvsg_use_accession --process_ref_homs --hgvs --fasta /opt/ensembl-vep/FASTA/Homo_sapiens.GRCh37.dna.primary_assembly.fa --force_overwrite --species homo_sapiens --merged --assembly GRCh37 --tab -i /home/VEPTEST_CSQT_sorted.vcf -o /home/VEPTEST_CSQT.txt --custom file=/home/VEPTEST_CSQT.vcf.gz,format=vcf,short_name=EXTRA,fields=FILTER%FORMAT%DP%ANT%CSQT%cosmic%GMAF%AA%AF1000G%phyloP 2>&1 ERROR: 10 Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm line 284. Can't use string ("1") as a HASH ref while "strict refs" in use at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 720, <__ANONIO__> line 5.
Also noticed now that the same error happens when the GT value is "0/."
Assuming that it would also happen with "./." and "./0", but those values weren't in the data.
Also getting that error with --individual_zyg all
with or without --process_ref_homs
Can't use string ("1") as a HASH ref while "strict refs" in use at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 720, <$fh> line 4410.
Thanks for reporting it. I can reproduce the error, I'll let you know when we have a fix in place.
The issue affects variants without alt alleles.
For example, this variant fails:
chr1 115258730 . C . 100 PASS . .
Alt allele: .
The code fix is available in this PR: https://github.com/Ensembl/ensembl-vep/pull/1709
@dglemos Have added this fix and tested and its all going through no errors now :)
Thanks!
I just updated to version 112 to make use of the --individual_zyg function, but i'm getting errors triggered when I run it.
My command is:
./vep --cache --dir_cache /opt/.vep/ --no_stats --check_existing --sift b --polyphen b --symbol --numbers --keep_csq --canonical --use_transcript_ref --exclude_predicted --transcript_version --protein --pubmed --variant_class --gene_phenotype --individual_zyg ind --allele_number --show_ref_allele --uploaded_allele --use_given_ref --hgvsg_use_accession --hgvs --fasta /opt/ensembl-vep/FASTA/Homo_sapiens.GRCh37.dna.primary_assembly.fa --force_overwrite --species homo_sapiens --merged --assembly GRCh37 --tab -i /input.vcf -o /output.txt
If I run it without the --individual_zyg flag, all runs as expected. But with it I get this error:
10 Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm line 284. Can't use an undefined value as an ARRAY reference at /opt/ensembl-vep/Bio/EnsEMBL/IO/Parser/BaseVCF4.pm line 784.
Any ideas what I missed?