Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

How to fetch JSON column header? #658

Open Hoeze opened 4 years ago

Hoeze commented 4 years ago

Hi, is there a way to get the VEP field description for a certain command-line call?

Example call: vep --no_check_variants_order --dir_cache [...] --assembly GRCh37 --format vcf --output_file STDOUT --sift b --polyphen s --af_gnomad --no_stats --cache --offline --json --merged

Is there a command to get the JSON field description? What I'd expect would be something like this:

> vep  --header-only [...] --af_gnomad --no_stats --json --merged
"assembly_name":"String",
    "allele_string":"String",
    "ancestral":"String",
    "colocated_variants":"Array"[
        "Struct"        {
            "aa_allele":"String",
            "aa_maf":Float64,
            ...
            "clin_sig":"Array"[
                "String"
            ],
            "end":Int32,
            ...
            "strand":Int32
        }
    ],
    "context":"String",
    "end":Int32,
    "id":"String",
    "input":"String",

System

bioconda ensembl-vep v98

aparton commented 4 years ago

Hi,

Thanks for your query. VEP does not have the ability to provide JSON schema descriptions in this manner. To attempt to answer your query in a more generic way, VEP will attempt to ’numberify’ all JSON output fields that look like numbers, excluding the fields:

seq_region_name id gene_id gene_symbol transcript_id

Additionally, you can find information on all VEP output fields here: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output

Kind Regards, Andrew

On 9 Dec 2019, at 13:26, Florian R. Hölzlwimmer notifications@github.com wrote:

Hi, is there a way to get the VEP field description for a certain command-line call?

Example call: vep --no_check_variants_order --dir_cache [...] --assembly GRCh37 --format vcf --output_file STDOUT --sift b --polyphen s --af_gnomad --no_stats --cache --offline --json --merged

Is there a command to get the JSON field description? What I'd expect would be something like this:

vep --header-only [...] --af_gnomad --no_stats --json --merged "assembly_name":"String", "allele_string":"String", "ancestral":"String", "colocated_variants":"Array"[ "Struct" { "aa_allele":"String", "aa_maf":Float64, ... "clin_sig":"Array"[ "String" ], "end":Int32, ... "strand":Int32 } ], "context":"String", "end":Int32, "id":"String", "input":"String", System

bioconda ensembl-vep v98

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ensembl/ensembl-vep/issues/658?email_source=notifications&email_token=AH56GN52YBGAOM5I3TS66VTQXZBPXA5CNFSM4JYJEKU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7CMNLQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH56GN4DIDBB4NE3Y5XCUBTQXZBPXANCNFSM4JYJEKUQ.

Hoeze commented 4 years ago

Thank you for your answer, Andrew. Would it be possible to add this feature?

Not knowing the fields VEP will return requires to always read through the whole VEP output, once to infer the columns and a second time to actually read it.

The best case where this matters might be hail. There you have to hand-code the field description (vep_json_schema).

I already created a description for a number of VEP fields:

VEP field description ```json "Struct"{ "assembly_name":"String", "allele_string":"String", "ancestral":"String", "colocated_variants":"Array"[ "Struct" { "aa_allele":"String", "aa_maf":Float64, "afr_allele":"String", "afr_maf":Float64, "allele_string":"String", "amr_allele":"String", "amr_maf":Float64, "clin_sig":"Array"[ "String" ], "end":Int32, "eas_allele":"String", "eas_maf":Float64, "ea_allele":"String", "ea_maf":Float64, "eur_allele":"String", "eur_maf":Float64, "exac_adj_allele":"String", "exac_adj_maf":Float64, "exac_allele":"String", "exac_afr_allele":"String", "exac_afr_maf":Float64, "exac_amr_allele":"String", "exac_amr_maf":Float64, "exac_eas_allele":"String", "exac_eas_maf":Float64, "exac_fin_allele":"String", "exac_fin_maf":Float64, "exac_maf":Float64, "exac_nfe_allele":"String", "exac_nfe_maf":Float64, "exac_oth_allele":"String", "exac_oth_maf":Float64, "exac_sas_allele":"String", "exac_sas_maf":Float64, "id":"String", "minor_allele":"String", "minor_allele_freq":Float64, "phenotype_or_disease":Int32, "pubmed":"Array"[ Int32 ], "sas_allele":"String", "sas_maf":Float64, "somatic":Int32, "start":Int32, "strand":Int32 } ], "context":"String", "end":Int32, "id":"String", "input":"String", "intergenic_consequences":"Array"[ "Struct" { "allele_num":Int32, "consequence_terms":"Array"[ "String" ], "impact":"String", "minimised":Int32, "variant_allele":"String" } ], "most_severe_consequence":"String", "motif_feature_consequences":"Array"[ "Struct" { "allele_num":Int32, "consequence_terms":"Array"[ "String" ], "high_inf_pos":"String", "impact":"String", "minimised":Int32, "motif_feature_id":"String", "motif_name":"String", "motif_pos":Int32, "motif_score_change":Float64, "strand":Int32, "variant_allele":"String" } ], "regulatory_feature_consequences":"Array"[ "Struct" { "allele_num":Int32, "biotype":"String", "consequence_terms":"Array"[ "String" ], "impact":"String", "minimised":Int32, "regulatory_feature_id":"String", "variant_allele":"String" } ], "seq_region_name":"String", "start":Int32, "strand":Int32, "transcript_consequences":"Array"[ "Struct" { "allele_num":Int32, "amino_acids":"String", "appris":"String", "biotype":"String", "canonical":Int32, "ccds":"String", "cdna_start":Int32, "cdna_end":Int32, "cds_end":Int32, "cds_start":Int32, "codons":"String", "consequence_terms":"Array"[ "String" ], "distance":Int32, "domains":"Array"[ "Struct" { "db":"String", "name":"String" } ], "exon":"String", "gene_id":"String", "gene_pheno":Int32, "gene_symbol":"String", "gene_symbol_source":"String", "hgnc_id":"String", "hgvsc":"String", "hgvsp":"String", "hgvs_offset":Int32, "impact":"String", "intron":"String", "lof":"String", "lof_flags":"String", "lof_filter":"String", "lof_info":"String", "minimised":Int32, "polyphen_prediction":"String", "polyphen_score":Float64, "protein_end":Int32, "protein_start":Int32, "protein_id":"String", "sift_prediction":"String", "sift_score":Float64, "strand":Int32, "swissprot":"String", "transcript_id":"String", "trembl":"String", "tsl":Int32, "uniparc":"String", "variant_allele":"String" } ], "variant_class":"String" } ```
aparton commented 4 years ago

Hi,

I've had a discussion with the team about this request and it's something that we're going to look into. Thank you for the suggestion.

Kind Regards, Andrew

Hoeze commented 4 years ago

Thanks a lot, Andrew!

Hoeze commented 4 years ago

@aparton I created a package which does the job of defining a schema by hand: https://github.com/Hoeze/firefly/blob/5b9dbed0b81ba3a84cda7ac767a43665570f5357/firefly/vep.py#L41 Maybe the yaml schema in firefly/resources/vep/ is helpful to push this issue :)

aparton commented 4 years ago

Hi @Hoeze,

Thank you for this - I'm glad you were able to resolve your issue, even if did take a significant amount of work!

I've added your schema to the associated ticket, I'm sure it'll be useful when we look closer at developing this feature.

Kind Regards, Andrew