Open bartcharbon opened 1 month ago
Hi @bartcharbon
Thanks for checking out the VCF output. Could you provide an example of how symbolic alleles are specified so that I can implement it correctly?
Hi @readmanchiu
attached is a zip file with an example vcf with symbolics, I added the straglr tsv for the same data as well. example.zip
Thanks @bartcharbon for the example. Is the motivation for outputting symbolics mostly to avoid the messiness of outputting the sequence? 'cuz the information it conveys is already reported in the genotype column.
I guess the number reported in the symbolic allele will just echo the number reported in the GT column. Interruptions will be ignored.
Originally my concern is there will be lots of <ALT>
s reported, but I guess it's OK
Anyways, I will include such an options for the next release.
I think that some of the older tools around (mostly short read like for example expansion hunter) report it like this, as a result some of our tooling used for downstream analysis expects this kind of output.
As well as the analysts in the lab being used to this notation.
But you also make a valid case for the use of the actual sequence, I'll take a new look with that in mind, to see if this might actually fit in our pipeline.
I'm close to finish implementing this option to output symbolic alleles. My assumption is that this will only be plausible for cases where the alleles detected have the same motif as the reference. But for cases like RFC1, where the expanded alleles may have a different motif from the reference, I will still need to output the actual sequence. Does it sound alright or is there a symbolic-allele way to deal with this?
Great that you are implementing this feature, thanks!
Looking at the vcf 4.2 spec symbolics are meant for "imprecise structural variants", based on that I think that even with a non reference motif symbolics are allowed to be used, since the ALT repeat motif is also in the output FORMAT fields all the information is still available in the VCF file.
Disclaimer: I'm no expert on STR's in VCF, this issue is based on differences we noticed compared to some other tools we use or used before. The cases where the motif differs from the reference are the ones we haven't seen much, due to the fact that they will not be present in control samples (either healthy or with a repeat that has the same motif as the reference) we use for testing, therefor I'm not 100% sure how other tools adress these cases.
ok, I guess then the copy number reported in symbolic alleles will refer to the actual repeat unit reported, whether it's the same as the reference or not. Some loci are more complicated with interruptions and what not, which will be handled in a later release. I will update the documentation accordingly.
Hi @readmanchiu
For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.
Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead? That would greatly help us with integrating the tool in our pipeline.