Closed Relequestual closed 7 years ago
Thanks for taking a look! Hmm, perhaps I'm mistaken, but I think deletions are represented in VCF with a longer reference than alt allele. An alt of "." actually corresponds to no variant
: http://www.internationalgenome.org/wiki/Analysis/vcf4.0/
Right you are. I think I probably hadn't realised the spec said VCF, and/or realised that we were storing these differently. Currently we replace dots with dashes, but that doesn't seem to be part of the spec either.
I guess this means DECIPHER is also returning variants with deletions in a wrong format / style. Although I'm not sure any other system CURRENTLY uses ref and alt to calculate similarity scores. I'll add a ticket to fix this.
As I mentioned briefly on our MME call yesterday, we have been talking internally about vairiant normalisation. I think generally we feel the VCF representation is normalised. We are investigating if we should normalise our database or not, AND if we normalise on input. Quite a tricky one!
Variant normalization is such and important and often-neglected thing. We're still running into a lot of issues with AF lookups for indels failing and causing common variants to appear novel.
P.S. Going to close this issue for now. If you see anything else in the schema that looks off, please do holler.
I didn't know till recently how confusing and missleading non-normalised variation can be! I actually think that a standard means of normalisation might be a really good use case for the gateway server. (Although saying that, not everyone is sending allel level data anyway).
Variant data will almost certainly become increasingly common, and normalization is going to be more-and-more important. One challenge with normalization is that, unless the database is also normalized using the exact same method, the normalized query still might miss. Perhaps the approach to take is that of enumerating all synonyms? I believe that is what BRCA exchange is doing. They might even have a service for it already.
Would be interesting to hear from BRCA exchange people on this matter. As for normalisation, once a varaition is normalised, it can be changed to a different normalisation structure / method if required. We are trying to work out if / how we should / can normalise our variants... but then what about varaints from external systems. Clinvar, exac, etc. Even if we DID normalise those externally source varaints, when someone then goes back to the source, it's going to look different... =/
At https://github.com/MatchmakerExchange/reference-server/blob/master/mme_server/schemas/api.json
https://github.com/MatchmakerExchange/reference-server/blob/master/mme_server/schemas/api.json#L136-L138 Alternative base should include deletion as a posibility. As such, the regex should be
^([ATCG\\.]+)$
Otherwise, looks great. Took me a while to remember why pattern properties is set in the way you have, but that makese sense! Additional properties must start with an underscore =]