bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Symbolic alleles in VCF output #40

Open bartcharbon opened 1 month ago

bartcharbon commented 1 month ago

Hi @readmanchiu

For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.

Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead? That would greatly help us with integrating the tool in our pipeline.

readmanchiu commented 1 month ago

Hi @bartcharbon

Thanks for checking out the VCF output. Could you provide an example of how symbolic alleles are specified so that I can implement it correctly?

bartcharbon commented 2 weeks ago

Hi @readmanchiu

attached is a zip file with an example vcf with symbolics, I added the straglr tsv for the same data as well. example.zip

readmanchiu commented 2 weeks ago

Thanks @bartcharbon for the example. Is the motivation for outputting symbolics mostly to avoid the messiness of outputting the sequence? 'cuz the information it conveys is already reported in the genotype column. I guess the number reported in the symbolic allele will just echo the number reported in the GT column. Interruptions will be ignored. Originally my concern is there will be lots of <ALT>s reported, but I guess it's OK Anyways, I will include such an options for the next release.

bartcharbon commented 3 days ago

I think that some of the older tools around (mostly short read like for example expansion hunter) report it like this, as a result some of our tooling used for downstream analysis expects this kind of output.

As well as the analysts in the lab being used to this notation.

But you also make a valid case for the use of the actual sequence, I'll take a new look with that in mind, to see if this might actually fit in our pipeline.

readmanchiu commented 2 days ago

I'm close to finish implementing this option to output symbolic alleles. My assumption is that this will only be plausible for cases where the alleles detected have the same motif as the reference. But for cases like RFC1, where the expanded alleles may have a different motif from the reference, I will still need to output the actual sequence. Does it sound alright or is there a symbolic-allele way to deal with this?

bartcharbon commented 2 days ago

Great that you are implementing this feature, thanks!

Looking at the vcf 4.2 spec symbolics are meant for "imprecise structural variants", based on that I think that even with a non reference motif symbolics are allowed to be used, since the ALT repeat motif is also in the output FORMAT fields all the information is still available in the VCF file.

Disclaimer: I'm no expert on STR's in VCF, this issue is based on differences we noticed compared to some other tools we use or used before. The cases where the motif differs from the reference are the ones we haven't seen much, due to the fact that they will not be present in control samples (either healthy or with a repeat that has the same motif as the reference) we use for testing, therefor I'm not 100% sure how other tools adress these cases.

readmanchiu commented 1 day ago

ok, I guess then the copy number reported in symbolic alleles will refer to the actual repeat unit reported, whether it's the same as the reference or not. Some loci are more complicated with interruptions and what not, which will be handled in a later release. I will update the documentation accordingly.