Using Manual String Field as 'Gene' for comp-het

MattWellie commented 2 years ago

I have a VCF which was annotated in Hail using VEP, and select annotation/CSQ fields were exported as independent INFO fields. Header extract (vep_gene_ids is a String field representing a single GeneID (ENSG), the name shouldn't have been a plural):

##INFO=<ID=SOR,Number=1,Type=Float,Description="">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="">
##INFO=<ID=VarDP,Number=1,Type=Integer,Description="">
##INFO=<ID=culprit,Number=1,Type=String,Description="">
##INFO=<ID=vep_csq,Number=1,Type=String,Description="">
##INFO=<ID=vep_gene_ids,Number=1,Type=String,Description="">
##INFO=<ID=vep_mane_select,Number=1,Type=String,Description="">
##INFO=<ID=vep_transcript_id,Number=1,Type=String,Description="">
##INFO=<ID=vep_canonical,Number=1,Type=Integer,Description="">
##INFO=<ID=vep_exon,Number=1,Type=String,Description="">
##INFO=<ID=vep_biotype,Number=1,Type=String,Description="">
##INFO=<ID=vep_hgvsc,Number=1,Type=String,Description="">
##INFO=<ID=vep_hgvsp,Number=1,Type=String,Description="">

I would like to use the comp-het functionality of Slivar, pointing at the vep_gene_ids as the Gene ID. I was flicking through the Slivar code and found the use of getEnv('CSQ_FIELD') to indicate a custom CSQ field. I re-headered the VCF with

##INFO=<ID=vep_gene_ids,Number=1,Type=String,Description="GENE">

and used export CSQ_FIELD="vep_gene_ids" to see if Slivar could find the field and treat the single entry as if it were a single CSQ field, but no luck.

Do you know a way to support the use of a custom field as a comp-het Gene field? I guess the same manual selection would also apply to using

bcftools +split-vep -d -c Consequence

To break out the CSQ mono-string into separate consequences, that could all be filtered individually before being run through comp-het.

An example use-case I have is exploding the rows based on consequence, and latching onto only those consequences which are against MANE transcripts. In this situation I don't think impactful is a sufficient representation of the clinical relevance of the variant, so splitting the single CSQ block out into individual consequences, and removing non-MANE CSQs is required.

MattWellie commented 2 years ago

I had a better look at the code and instead reheadered with

##INFO=<ID=vep_gene_ids,Number=1,Type=String,Description="Format: GENE">

and used CSQ_FIELD="vep_gene_ids". This was slightly more successful, and the gene field was recognised, but the process still failed:

[slivar] unable to find consequence field in vep_gene_ids

So it looks like if a custom field is being used as a custom CSQ field, it must contain at least one GENE, CONSEQUENCE, and TRANSCRIPT value in order for parsing to succeed, meaning that the process can't currently accept single-string Gene value fields.

MattWellie commented 2 years ago

Looking into this a bit more, the gene/transcript/consequence all factor into the decision about whether to retain a variant as a candidate for the C-H process, so it's probably asking for a bit much to allow for all/some of these fields to be accessed in novel locations.

brentp commented 2 years ago

Yes, you're correct that it needs all 3. You could construct a field that has:

MYCSQ=$GENE|$TRANSCRIPT|$IMPACT

and then make the appropriate header with Description field. Beyond that, I think it will be hard to do what you want with slivar.

MattWellie commented 2 years ago

Thanks! I think it should be possible to do that, so I'll have a crack tomorrow.

brentp commented 2 years ago

ok. then in the header, you'd need something like:

##INFO=<ID=CSQ,Number=.,Type=String,Description="MyAnno: 'Gene | Transcript | Consequence' ">

so slivar knows how to parse it.

MattWellie commented 2 years ago

@brentp (and any else who might find this), this worked for me:

created an aggregate annotation field per variant, e.g. COMPOUND_CSQ=ENSG00000103197|ENST00000219476|missense_variant

reheadered the VCF


bcftools view -h "${file_in}" | \
sed 's/##INFO=<ID=COMPOUND_CSQ,Number=1,Type=String,Description="">/##INFO=<ID=COMPOUND_CSQ,Number=1,Type=String,Description="Format: 'Gene\|Transcript\|Consequence'">/' > new_header

bcftools reheader -h new_header --threads 4 -o "${file_out}" "${file_in}"



- ran a compound het check using the custom field

`CSQ_FIELD="COMPOUND_CSQ" slivar compound-hets -p INPUT.ped -o OUT.vcf -v INPUT.vcf.bgz -a`

One note, this compound field can't be used for `impactful`? As far as I can see in the src there is no equivalent to `for f in ["ANN", "CSQ", "BCSQ", getEnv("CSQ_FIELD")]:` used in getting the Consequence in the main `expr` mode; this only exists for compound-hets. Then again this is the first time I've ever really read JS, so I could be wrong there.

Thanks for the help!

brentp / slivar

Using Manual String Field as 'Gene' for comp-het #120