Revisit processing variant without names in the repeat expansion pipeline

tskir commented 3 years ago

Follow up from #247.

The repeat expansion pipeline has two ways to determine whether a variant is a repeat expansion event:

By looking at the variant's name and trying to decode the HGVS-like notation
By looking at the difference of provided REF and ALT alleles

This means that variants without a name, in general, could and should still be processed by the repeat expansion pipeline. In particular, the variant cited in the original issue contains allele sequences, which means it should be treated as a repeat expansion variant:

<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current" Chr="9" Accession="NC_000009.12" start="117724889" stop="117724890" display_start="117724889" display_stop="117724890" variantLength="12" positionVCF="117724889" referenceAlleleVCF="T" alternateAlleleVCF="TACACACACACAC" />

apriltuesday commented 3 years ago

@tskir I've done some more digging on this, my sense is that we can use the REF and ALT, but I don't know a good algorithm to determine the repeating unit, which will make it hard to distinguish between trinucleotide_repeat_expansion and short_tandem_repeat_expansion. (It's entirely possible this is included somewhere else and I just didn't see it, in which case... whoops, my bad.) If there's a reliable way of doing so then it could be a good option to have available.

A viable possibility for the variants mentioned in the original ticket is to use HGVS identifiers that are in the measure but not provided as variant names, which can be parsed in the same way we do variant names. There are generally multiple of these in a measure and I'm guessing we need to use a RefSeq one, but I'm not sure beyond that.

<AttributeSet>
  <Attribute Accession="LRG_320" Type="HGVS, genomic, LRG">LRG_320:g.25489CA[28]</Attribute>
</AttributeSet>
<AttributeSet>
  <Attribute Accession="NG_011475" Version="2" Change="g.25489CA[28]" Type="HGVS, genomic, RefSeqGene">NG_011475.2:g.25489CA[28]</Attribute>
</AttributeSet>
<AttributeSet>
  <Attribute Accession="NC_000009" Version="12" Change="g.117724891CA[28]" Type="HGVS, genomic, top level" integerValue="38">NC_000009.12:g.117724891CA[28]</Attribute>
</AttributeSet>
<AttributeSet>
  <Attribute Accession="NC_000009" Version="11" Change="g.120487169CA[28]" Type="HGVS, genomic, top level, previous" integerValue="37">NC_000009.11:g.120487169CA[28]</Attribute>
</AttributeSet>
<AttributeSet>
  <Attribute Type="HGVS, protein" />
</AttributeSet>

I'm writing up a PR with this second approach, but I'm curious what your thoughts are on the first approach, as well as which HGVS I should be using.

apriltuesday commented 3 years ago

In case concrete examples are helpful, here are the seven in question in chr_pos_ref_alt form:

11_88291569_T_TGAAAGAAAGAAAGAAAGAAA
15_72340841_T_TCACACACACACA
15_72340841_T_TCACACACACACACACA
9_117724889_T_TACACACACACAC
7_101247300_A_AAAAAAAAATATAT
4_38792853_T_TTATATATATATATATATATATATATATATATA
20_31663045_T_TATATATATATATACACACACATATATATACATATATACATATATATACACACACATATACACACACACACACACACAC

M-casado commented 3 years ago

Hi, @apriltuesday.

Regarding the mentioned issue:

Making use of REF and ALT alleles would probably doable. The issue is that inferring a repeated event involves an additional step that requires, most likely, an external tool that does so. I found resources like the Tandem Repeats Finder (TRF) that, for what I've seen, revolves around multiple sequence alignments (MSA) shifting the sequence to find 1-6 nucleotide k-mers. These tools, nevertheless, tend to work well taking into account the context (adjacent sequences). I have not used them with the alternative allele alone, but with a considerable sequence length, they could do us a favour.
Using HGSV identifiers seems a bit more straightforward and easier. Regarding what type of HGSV to use, I would first take a look at what identifiers are consistently given within all ClinVar XMLs (at least those without the Name node). As far as I have seen, most are standard and interchangeable (to uniquely identify a gene), so a priori it doesn't seem to be a big issue. If it serves of inspiration, NCBI only accepts sequence identifiers, CCDS identifiers (unique version and number for annotated genes), or LRG sequence identifiers (created specifically for clinical reporting by manual curation). We could make use of regular expressions to match short tandems (e.g. LRG_320:g.25489CA[10]). Nevertheless, if the tandems are slightly different or interrupted (e.g. LRG_320:g.25489CA[5]CG[2]CA[3]) we would end up again having to resort to MSAs.

apriltuesday commented 3 years ago

Thanks @M-casado, this is useful input! I agree that probably using REF and ALT properly is something best left as a last resort.

For parsing HGSV, I probably should've mentioned for context that we do have regexes to parse these already in the code, though I don't think it handles things like LRG_320:g.25489CA[5]CG[2]CA[3] flawlessly. If the types are essentially interchangeable I'm inclined to just choose one arbitrarily, among the variants in question it looks like HGVS, genomic, top level is always present, and LRG and RefSeqGene occasionally present (see ClinVar nomenclature here).

tcezard commented 3 years ago

I had a look at this paper describing an algorithm for SSR detection. It has the advantage of having a python implementation that we could leverage but unfortunately it seems to work from precomputed list of repeat which would not work in our case.

EBIvariation / CMAT

Revisit processing variant without names in the repeat expansion pipeline #262