ACEnglish / truvari

Structural variant toolkit for VCFs
MIT License
321 stars 48 forks source link

Question: meaning of the parameter MINHAPLEN #224

Closed leone93 closed 2 months ago

leone93 commented 2 months ago

Following the title, I'm not sure of the meaning of this parameter, how it works, and if it have relation with SIZEMIN. Could you please explain me a little bit better with an example? Thank you Adam Have a nice day Leo

ACEnglish commented 2 months ago

This is a parameter I wouldn't recommend you use. I've included a description of what it is doing below if you're curious. But in general as long as you don't provide --reference to bench or collapse, it will not use the --minhaplen parameter. The 'unroll' sequence comparison technique (i.e. not using --reference) is a more accurate measure of sequence similarity. details

Originally, truvari created a 'shared reference context' between variants to help improve the measurement of sequence similarity. However, this would inflate the sequence similarity between variants. For example, two non-overlapping deletions that are 100bp apart would have 100bp of reference sequence shared between their two haplotypes. This approach was designed to help deal with tandem repeats sequence contexts. The minhaplen would ensure that the sequences which were compared were of a minimum length. For example, consider these two deletions:

REF   ATCATCATC
D1    AT---CATC
D2    ATCAT---C

D1 Hap: CAT
D2 Hap: CAT

So each deletion leaves the same shared reference sequence. But what we found is that if there wasn't much sequence pulled from the reference, (in this example, just 3bp) the haplotype similarity became overly sensitive to small differences. So minhaplen would ensure that a minimum length from the variants' sequence context was being considered, which would inflate the above example from considering only 3bp to --minhaplen 5 Would become:

D1 Hap: TCATC
D2 Hap: TCATC

Which doesn't make a difference for this contrived example, but for many typical cases it would.

The reference context method of variant sequence comparison (--reference) will be deprecated in a future version of Truvari has only been kept for backwards compatibility.

leone93 commented 2 months ago

Thanks Adam!