exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
191 stars 54 forks source link

Add support for STR prioritisation from ExpansionHunter calls #563

Open julesjacobsen opened 3 weeks ago

julesjacobsen commented 3 weeks ago

ExpansionHunter is used in Genomics England for detecting these from short read sequencing. This is the example output: https://github.com/Illumina/ExpansionHunter/blob/master/docs/06_OutputVcfFiles.md#example

The following VCF entry describes the state of C9orf72 repeat in a sample with name/barcode LP6005616-DNA_A03.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005616-DNA_A03
chr9    27573526        .       C       <STR2>,<STR349> .       PASS    SVTYPE=STR;END=27573544;REF=3;RL=18;RU=GGCCCC;REPID=ALS GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/2:SPANNING/INREPEAT:2/349:2-2/323-376:19/0:3/6:0/459

This line tells us that first allele spans 2 repeat units while the second allele spans 349 repeat units. The repeat unit is GGCCCC (RU INFO field), so the sequence of the first allele is GGCCCCGGCCCC and the sequence of the second allele is GGCCCC x 349. The repeat spans three repeat units in the reference (REF INFO field). The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (323,376). There are 19 spanning and 3 flanking reads consistent with the repeat allele of size 2 (that is 19 reads fully contain the repeat of size 2 and 2 flanking reads overlap at most 2 repeat units). Also, there are 6 flanking and 459 in-repeat reads consistent with the repeat allele of size 349.

PanelApp has info on the pathogenicity for STRs e.g. https://panelapp.genomicsengland.co.uk/panels/entities/C9orf72_GGGGCC