Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

--individual_zyg ind or all option triggers error. #1703

Closed schreyers closed 21 hours ago

schreyers commented 2 weeks ago

I just updated to version 112 to make use of the --individual_zyg function, but i'm getting errors triggered when I run it.

My command is:

./vep --cache --dir_cache /opt/.vep/ --no_stats --check_existing --sift b --polyphen b --symbol --numbers --keep_csq --canonical --use_transcript_ref --exclude_predicted --transcript_version --protein --pubmed --variant_class --gene_phenotype --individual_zyg ind --allele_number --show_ref_allele --uploaded_allele --use_given_ref --hgvsg_use_accession --hgvs --fasta /opt/ensembl-vep/FASTA/Homo_sapiens.GRCh37.dna.primary_assembly.fa --force_overwrite --species homo_sapiens --merged --assembly GRCh37 --tab -i /input.vcf -o /output.txt

If I run it without the --individual_zyg flag, all runs as expected. But with it I get this error:

10 Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm line 284. Can't use an undefined value as an ARRAY reference at /opt/ensembl-vep/Bio/EnsEMBL/IO/Parser/BaseVCF4.pm line 784.

Any ideas what I missed?

schreyers commented 2 weeks ago

For extra info, this is the VCF data I'm using:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr2 48030639 rs267608087;rs587782425;rs748452299 A AC 27 q30;R3x6;LowVariantFreq DP=9288;clinvar=1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic;cosmic=1|COSM308681;CSQT=1|MSH6|NM_000179.2|frameshift_variant,1|FBXO11|NM_001190274.1|downstream_gene_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:27:9166,122:9288:0.013:20:-11.6920:0.0000 chr2 209113113 rs121913499 G A 100 PASS DP=5374;clinvar=1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|likely_pathogenic,1|pathogenic;cosmic=1|COSM28747;phyloP=3.794;CSQT=1|IDH1|NM_005896.3|missense_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:100:4280,1092:5374:0.203:20:-100.0000:0.0037 chr3 38184515 . TTG T 23 q30;LowVariantFreq DP=3950;cosmic=1|COSN28015016;CSQT=1|MYD88|NM_001172567.1|downstream_gene_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:23:3893,57:3950:0.014:20:-12.2691:0.0000

I also do some custom changes to merge the FORMAT column into the INFO column

Thanks :)

dglemos commented 2 weeks ago

Hi @schreyers, The problem is in your input file. In your VEP command you specify the individual ind (--individual_zyg ind) however, your file does not have the ind in the header.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

It should be: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind

schreyers commented 2 weeks ago

Thanks :)

So does the ZYG column not get added in like other columns?

I tried to specify "all" but got the same error.

Is there a way to get the ZYG column without changing the input file? Or if I just add an "ind" column with no data, will that solve the issue?

dglemos commented 2 weeks ago

The data is already in the input file, the only thing missing is the column's name ind.

Using the second variant from your example:

CHROM: chr3
POS: 38184515
ID: .
REF: TTG
ALT: T
QUAL: 23
FILTER: q30;LowVariantFreq
INFO: DP=3950;cosmic=1|COSN28015016;CSQT=1|MYD88|NM_001172567.1|downstream_gene_variant
FORMAT: GT:GQ:AD:DP:VF:NL:SB:NC
<missing column name>: 0/1:23:3893,57:3950:0.014:20:-12.2691:0.0000
schreyers commented 2 weeks ago

Ah ok, previously there was the case number there but i took that out for privacy.

Ok, I'll give that a go :)

Much appreciated!

schreyers commented 1 week ago

@dglemos sorry, it is working (sometimes). Had some small issues before, but resolved now (sometimes).

I was getting this error:

10 Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm line 284. Can't use string ("1") as a HASH ref while "strict refs" in use at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 720.

So the script to merge the FORMAT data into the INFO column, and also to add the "ind" column with the original data is:

$server_merge_format = 'gawk \'BEGIN { OFS="\t" } 
{
    if ($0 ~ /^#/) {
        # Modify the header line
        if ($0 ~ /^#CHROM/) {
            n = split($0, headers, OFS);
            headers[n] = "ind";
            $0 = headers[1];
            for (i = 2; i <= n; i++) {
                $0 = $0 OFS headers[i];
            }
            print;
        } else {
            print;
        }
    } else {
        # Capture INFO and FORMAT fields and merge them appropriately
        info = $8;
        format_header = $9;
        format_values = $10;
        format_values_orig = $10;

        # Replace commas with hashtags in format_values, this stops data going missing.
        gsub(",", "#", format_values);

        # Ensure there are no line breaks or extra spaces in info
        gsub(/[\r\n\t]/, "", info);

        # Construct the new INFO field. The "|" is used instead of comma to stop errors.
        new_info = info ";FORMAT=" format_header "|" format_values;

        # Ensure there are no line breaks or extra spaces in new_info
        gsub(/[\r\n\t]/, "", new_info);

        # Construct the new line
        new_line = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6 OFS $7 OFS new_info OFS format_header OFS format_values_orig;

        print new_line;
    }
}\' '. $i_input . ' > ' . $i_input_merged;
schreyers commented 1 week ago

@dglemos Sorry to keep going back and forth with this.

I can't seem to see the difference in data that works and ones that throw an error.

This one throws an error:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind

chr1 115258730 . C . 100 PASS DP=3582;FORMAT=GT:GQ:AD:DP:VF:NL:SB:NC|0/0:100:3579:3582:0.001:20:-100.0000:0.0031;FORMAT=GT:GQ:AD:DP:VF:NL:SB:NC|0/0:100:3579:3582:0.001:20:-100.0000:0.0031 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:100:3579:3582:0.001:20:-100.0000:0.0031

This one works:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind

chr2 48030639 rs267608087;rs587782425;rs748452299 A AC 27 q30;R3x6;LowVariantFreq DP=9288;clinvar=1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic,1|pathogenic;cosmic=1|COSM308681;CSQT=1|MSH6|NM_000179.2|frameshift_variant,1|FBXO11|NM_001190274.1|downstream_gene_variant;FORMAT=GT:GQ:AD:DP:VF:NL:SB:NC|0/1:27:9166#122:9288:0.013:20:-11.6920:0.0000 GT:GQ:AD:DP:VF:NL:SB:NC 0/1:27:9166,122:9288:0.013:20:-11.6920:0.0000

dglemos commented 1 week ago

For your first variant chr1 115258730 C . the genotype is 0/0 which means it is homozygous for the reference allele. Maybe this can help you understand a bit more about why the file has variants with this genotype: https://www.biostars.org/p/349323/ VEP will skip 'variants' like this.

schreyers commented 1 week ago

Ok, that does make sense thanks :)

So would adding the --quiet flag solve this issue?

If not thats ok, as that article you sent says that you could just get the ZYG value from the GT variable

dglemos commented 1 week ago

Can you please try with the option --process_ref_homs?

schreyers commented 1 week ago

Gave that a go but still got the same error:

ANNOTATION failed: 255 ANNOTATION CMD: ./vep --cache --dir_cache /opt/.vep/ --no_stats --check_existing --sift b --polyphen b --symbol --numbers --keep_csq --canonical --use_transcript_ref --exclude_predicted --transcript_version --protein --pubmed --variant_class --gene_phenotype --individual_zyg ind --allele_number --show_ref_allele --uploaded_allele --use_given_ref --hgvsg_use_accession --process_ref_homs --hgvs --fasta /opt/ensembl-vep/FASTA/Homo_sapiens.GRCh37.dna.primary_assembly.fa --force_overwrite --species homo_sapiens --merged --assembly GRCh37 --tab -i /home/VEPTEST_CSQT_sorted.vcf -o /home/VEPTEST_CSQT.txt --custom file=/home/VEPTEST_CSQT.vcf.gz,format=vcf,short_name=EXTRA,fields=FILTER%FORMAT%DP%ANT%CSQT%cosmic%GMAF%AA%AF1000G%phyloP 2>&1 ERROR: 10 Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm line 284. Can't use string ("1") as a HASH ref while "strict refs" in use at /opt/ensembl-vep/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 720, <__ANONIO__> line 5.

schreyers commented 1 week ago

Also noticed now that the same error happens when the GT value is "0/."

Assuming that it would also happen with "./." and "./0", but those values weren't in the data.

fozzzgate commented 1 week ago

Also getting that error with --individual_zyg all with or without --process_ref_homs

Can't use string ("1") as a HASH ref while "strict refs" in use at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 720, <$fh> line 4410.

dglemos commented 5 days ago

Thanks for reporting it. I can reproduce the error, I'll let you know when we have a fix in place.

dglemos commented 4 days ago

The issue affects variants without alt alleles. For example, this variant fails: chr1 115258730 . C . 100 PASS . . Alt allele: .

The code fix is available in this PR: https://github.com/Ensembl/ensembl-vep/pull/1709

schreyers commented 21 hours ago

@dglemos Have added this fix and tested and its all going through no errors now :)

Thanks!