brentp / slivar

genetic variant expressions, annotation, and filtering for great good.
MIT License
247 stars 23 forks source link

VEP annotations other than impact? #146

Open karynne7 opened 1 year ago

karynne7 commented 1 year ago

Is there a way to pull other annotations from the CSQ field other than the built in INFO.impactful? I see that some of the flags need to be integers or flags, but I have a loftee annotations with "HC" or "LC" I'd like to filter on. (Yes, it could be turned into a flag, but there are other annotations less binary too)

I was thinking about bcftools view -i 'INFO/CSQ[70]=="HC"' but that didn't work, and I'm not sure it would capture the whole annotation if there were multiple transcripts. Happy to write my own work around, but figured I'd double check with you first! Thanks!

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|NEAREST|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info|existing_InFrame_oORFs|existing_OutOfFrame_oORFs|existing_uORFs|five_prime_UTR_variant_annotation|five_prime_UTR_variant_consequence|REVEL|SpliceRegion|AF_gnomadall|ALT|AUG_eventAUG_event|var_frame|var_stop_loc|var_kozak|ref_frame|ref_stop_loc|ref_kozakvar_frameAUG_event|var_frame|var_stop_loc|var_kozak|ref_frame|ref_stop_loc|ref_kozakvar_stop_locAUG_event|var_frame|var_stop_loc|var_kozak|ref_frame|ref_stop_loc|ref_kozakvar_kozakAUG_event|var_frame|var_stop_loc|var_kozak|ref_frame|ref_stop_loc|ref_kozakref_frameAUG_event|var_frame|var_stop_loc|var_kozak|ref_frame|ref_stop_loc|ref_kozakref_stop_locAUG_event|var_frame|var_stop_loc|var_kozak|ref_frame|ref_stop_loc|ref_kozakref_kozak|CADD_PHRED|CLNSIG|CLNSIGINCL|CLNVC|CLNVI|Delta_G4|Delta_dsRNA|FREQ_exomes|FREQ_genomes|FREQ_gnomadall|G4|GENEINFO|Literature_source|MC|NCboost_chr_rank_perc|PMID|REF|ReMM_probability|STOP_eventSTOP_event|var_preceeding_start_loc|ref_preceeding_start_locvar_preceeding_start_locSTOP_event|var_preceeding_start_loc|ref_preceeding_start_locref_preceeding_start_loc|TE|TE_log2fold|canonical|chr|distance|dsRNA|gene_name|pos|ref|strand|transcript|transcript_ver|type|var">
1   138852  .   C   T   .   .   AC=1;AN=700;HPO_CT=1;GENE=LOC729737;MRNA=NR_039983.2;FXN=non-coding-exon;HGVS_CDNA=n.1395G>A;HGVS_PROT=.;ESP_AF=0;GNOMAD_AF=0.000106689;G1K_AF=0;CADD=27.2;nhet_aff=1;nhet_unaff=0;nhomalt_aff=0;nhomalt_unaff=0;nhet_female_aff=0;nhet_female_unaff=0;nhomalt_female_aff=0;nhomalt_female_unaff=0;nhet_male_aff=1;nhet_male_unaff=0;nhomalt_male_aff=0;nhomalt_male_unaff=0;maxAAF=0.00173611;bravo_AF=0.000615815;CSQ=T|stop_gained|HIGH|AL627309.1|ENSG00000237683|Transcript|ENST00000423372|protein_coding|1/2||ENST00000423372.3:c.458G>A|ENSP00000473460.1:p.Trp153Ter|528/2661|458/780|153/259|W/*|tGg/tAg|rs540391832||-1||SNV|Clone_based_ensembl_gene||YES||||ENSP00000473460||R4GN28&B7Z7W4|UPI0002C88512||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||HC|||GERP_DIST:0&BP_DIST:1223&PERCENTILE:0.587179487179487179&DIST_FROM_LAST_EXON:374&50_BP_RULE:PASS&PHYLOCSF_TOO_SHORT|||||||||||||||||||||||||||||||||||||||||||,T|downstream_gene_variant|MODIFIER|CICP27|ENSG00000233750|Transcript|ENST00000442987|processed_pseudogene||||||||||rs540391832|4016|1||SNV|HGNC|48835|YES|||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|downstream_gene_variant|MODIFIER|RP11-34P13.13|ENSG00000241860|Transcript|ENST00000484859|antisense||||||||||rs540391832|2622|-1||SNV|Clone_based_vega_gene||YES|||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|downstream_gene_variant|MODIFIER|RP11-34P13.13|ENSG00000241860|Transcript|ENST00000490997|antisense||||||||||rs540391832|3956|-1||SNV|Clone_based_vega_gene|||||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|downstream_gene_variant|MODIFIER|RP11-34P13.14|ENSG00000239906|Transcript|ENST00000493797|antisense||||||||||rs540391832|938|-1||SNV|Clone_based_vega_gene||YES|||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|upstream_gene_variant|MODIFIER|RP11-34P13.15|ENSG00000268903|Transcript|ENST00000494149|processed_pseudogene||||||||||rs540391832|2957|-1||SNV|Clone_based_vega_gene||YES|||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|upstream_gene_variant|MODIFIER|RP11-34P13.16|ENSG00000269981|Transcript|ENST00000595919|processed_pseudogene||||||||||rs540391832|887|-1||SNV|Clone_based_vega_gene||YES|||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000249765|CTCF_binding_site||||||||||rs540391832||||SNV||||||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000918299|TF_binding_site||||||||||rs540391832||||SNV||||||||||||AL627309.1||||||0.0008|0.003|0|0|0|0|||0.0001265|0.002325|0|0|0|0|0|0|9.197e-05|0.003|AFR|||||||||||||||||||||||||||||||||||||||||||||||||||||||;SpliceAI=T|AL627309.1|0.00|0.00|0.00|0.00|-2|-14|-2|-5
brentp commented 1 year ago

For now, I think it's better to do this type of thing with vembrane from @tedil as it can do this directly already.

For slivar, it's not yet implemented. This is more than trivial in part because CSQ is actually a nested array--an array for each transcript. So you'd need something like (this is not implemented, just brainstorming):

INFO.CSQ[0].LoF == "HC" || INFO.CSQ[1].LoF == "HC"

there's not information about what's a string and what's not, so for numeric, you'd still need to do:

parseFloat(INFO.CSQ[0].gnomAD_AMR_AF) < 0.01 ...

but,

karynne7 commented 1 year ago

Thank you, Brent! I'll look into it!

jxchong commented 1 year ago

Hey @brentp just checking in if you've ever implemented CSQ filtration in slivar or will in the near future. We're at a point where we need to move away from GEMINI but it's been a bit of a headache to figure out how to port over all our annotation filters to a slivar-based pipeline.

brentp commented 1 year ago

Hi Jessica. I'm looking into this now. I have started a branch here to explore: https://github.com/brentp/slivar/tree/csq I should have progress in the next 3 weeks to know at least if it's feasible, if not have it completed.

brentp commented 1 year ago

Hi, can you try out this file: https://raw.githubusercontent.com/brentp/slivar/83591adf2d154c45ecf5e64c87e4ba1635e874b6/js/csq.js ?

then you can do commands like:

./slivar expr \
    --js js/csq.js \
    -v tests/test.vcf \
    --pass-only \
    --info "CSQs(INFO.CSQ, VCF.CSQ, ['PolyPhen', 'SIFT']).some(function(csq) { return csq.SIFT > 10 })" \
   -o high-sift.vcf

so the expression is: CSQs(INFO.CSQ, VCF.CSQ, ['PolyPhen', 'SIFT']).some(function(csq) { return csq.SIFT > 10 })

note that CSQs is a function that returns a javascript array of CSQ objects. With keys always capitalized. The first and 2nd arg to that function will always be the same (unless your field is e.g. ANN from snpEff or BCSQ from bcftools), the final argument is a list of fields in the CSQ string that will be treated as numeric, so should include and allele frequency fields.

Because this gives an array, you can use methods like .some() which accepts a closure for each CSQ object. Here's an example getting only missense variants from BCSQ:

'BCSQ' in INFO && CSQs(INFO.BCSQ, VCF.BCSQ, []).some(function(csq) { return csq.CONSEQUENCE == 'missense' })

Note also that duktape javascript engine used by slivar does not support "fat arrow" functions like .some(csq => csq.CONSEQUENCE == "missense").

Please let me know anything that's missing from this. I'm sure there's more we can do to make it ergonomic. If you want to use js/slivar-functions.js and js/csq.js, for now, you'll have to concatenate them yourself.

jxchong commented 1 year ago

Thanks. We're definitely slowly struggling our way through the new syntax. :)

Can you provide an example of how we would, e.g. filter for variants that both pass INFO.genic AND have a SIFT <0.05?

brentp commented 1 year ago

Can you provide an example of how we would, e.g. filter for variants that both pass INFO.genic AND have a SIFT <0.05?

--info "INFO.genic && CSQs(INFO.CSQ, VCF.CSQ, ['SIFT']).some(function(csq) { return csq.SIFT < 0.05 })"

so for CSQ, it will always be:

CSQs(INFO.CSQ, VCF.CSQ, array_of_numeric_fields).some(function(csq) { return [expression here that uses *csq*] })

so you'll change:

The some is like python's any it just checks if the expression is true for any transcript.

brentp commented 1 year ago

I documented this in more detail here: https://github.com/brentp/slivar/wiki/CSQ

wwgordon commented 1 year ago

Hi @brentp ,

Thanks so much for all your help and clarification. I'm trying to sort out some of the nuances of how these queries work and am hoping you could clear something up for me. As an example:

--info "(INFO.highest_impact_order < ImpactOrder.synonymous) && variant.FILTER == 'PASS'" 

This will work for me, I presume because "synonymous_variant" is a Consequence present in my VCF and Slivar is smart enough to match "synonymous" to "synonymous_variant."

However, this will not work for me:

--info "(INFO.highest_impact_order < ImpactOrder.5_prime_UTR) && variant.FILTER == 'PASS'" \

...producing the following error:

Error: unhandled exception: SyntaxError: invalid number literal (line 1)

This is surprising to me, because "5_prime_UTR_variant" is present as a Consequence in my VCF and I would expect the matching to work similarly as the "synonymous" example.

Finally, this also does not work for me:

--info "(INFO.highest_impact_order < ImpactOrder.TFBS_ablation) && variant.FILTER == 'PASS'"

Producing many of the following errors:

[slivar] error evaluating info expression (this can happen if a field is missing):
error from duktape: unknown attribute:TFBS_ablation for expression:(INFO.highest_impact_order < ImpactOrder.TFBS_ablation) && variant.FILTER == 'PASS'

I presume because there are no "TFBS" related Consequences in my VCF. I include this example because I would expect the query to look for all of the Consequence terms above "TFBS_ablation" in the default-order.txt file, many of which are present. Is it expected behavior that the "threshold Consequence" must be present in the VCF? This is a secondary question, I am primarily interested in understanding how Slivar handles the "_variant" suffix.

Thanks again!

brentp commented 1 year ago

Hi, for this one:

--info "(INFO.highest_impact_order < ImpactOrder.5_prime_UTR) && variant.FILTER == 'PASS'" 

This is because ImpactOrder is a javascript object and a property can't start with a number. We can see the same problem in the javascript console for: obj = {'b': 1, '5_prime': 22} where obj.b is fine, but obj.5_prime gives an error. You can get around this with obj['5_prime'], or in your case:

"(INFO.highest_impact_order < ImpactOrder['5_prime_utr']) ... "

Note that lower-case "utr"

for TFBS_ablation try tfbs_ablation instead. All impacts will be lower-cased. I didn't document this anywhere. I'll do so now!

wwgordon commented 1 year ago

I seem to be having trouble with the named indices. "(INFO.highest_impact_order < ImpactOrder['5_prime_utr']) ... " produces the Error: unhandled exception: SyntaxError: expected identifier (line 1) error. Notably, I get the same error when trying to use "synonymous" as a threshold, which was working before with the original notation:

./slivar expr \
    --vcf $vcf --ped $ped \
    --trio "${moi}:kid.alts == 2 && kid.GQ >= 20 \
                && mom.alts == 1 && mom.GQ >= 20 \
                && dad.alts == 1 && dad.GQ >= 20" \
    --info "variant.FILTER == 'PASS' \
                 && INFO.highest_impact_order < ImpactOrder.synonymous" \
    --pass-only > $intermediate 2> ./slivar_stderr.txt

... works, but

./slivar expr \
    --vcf $vcf --ped $ped \
    --trio "${moi}:kid.alts == 2 && kid.GQ >= 20 \
                && mom.alts == 1 && mom.GQ >= 20 \
                && dad.alts == 1 && dad.GQ >= 20" \
    --info "variant.FILTER == 'PASS' \
                 && INFO.highest_impact_order < ImpactOrder.['synonymous']" \
    --pass-only > $intermediate 2> ./slivar_stderr.txt

...does not work, producing the same error as above.

The lower case ImpactOrder.tfbs_ablation works beautifully, thanks!

brentp commented 1 year ago

instead of ImpactOrder.['synonymous'], you want ImpactOrder['synonymous']

wwgordon commented 1 year ago

Aha, perfect. Thanks!