Open bartcharbon opened 5 months ago
Hi @bartcharbon
Thanks for checking out the VCF output. Could you provide an example of how symbolic alleles are specified so that I can implement it correctly?
Hi @readmanchiu
attached is a zip file with an example vcf with symbolics, I added the straglr tsv for the same data as well. example.zip
Thanks @bartcharbon for the example. Is the motivation for outputting symbolics mostly to avoid the messiness of outputting the sequence? 'cuz the information it conveys is already reported in the genotype column.
I guess the number reported in the symbolic allele will just echo the number reported in the GT column. Interruptions will be ignored.
Originally my concern is there will be lots of <ALT>
s reported, but I guess it's OK
Anyways, I will include such an options for the next release.
I think that some of the older tools around (mostly short read like for example expansion hunter) report it like this, as a result some of our tooling used for downstream analysis expects this kind of output.
As well as the analysts in the lab being used to this notation.
But you also make a valid case for the use of the actual sequence, I'll take a new look with that in mind, to see if this might actually fit in our pipeline.
I'm close to finish implementing this option to output symbolic alleles. My assumption is that this will only be plausible for cases where the alleles detected have the same motif as the reference. But for cases like RFC1, where the expanded alleles may have a different motif from the reference, I will still need to output the actual sequence. Does it sound alright or is there a symbolic-allele way to deal with this?
Great that you are implementing this feature, thanks!
Looking at the vcf 4.2 spec symbolics are meant for "imprecise structural variants", based on that I think that even with a non reference motif symbolics are allowed to be used, since the ALT repeat motif is also in the output FORMAT fields all the information is still available in the VCF file.
Disclaimer: I'm no expert on STR's in VCF, this issue is based on differences we noticed compared to some other tools we use or used before. The cases where the motif differs from the reference are the ones we haven't seen much, due to the fact that they will not be present in control samples (either healthy or with a repeat that has the same motif as the reference) we use for testing, therefor I'm not 100% sure how other tools adress these cases.
ok, I guess then the copy number reported in symbolic alleles will refer to the actual repeat unit reported, whether it's the same as the reference or not. Some loci are more complicated with interruptions and what not, which will be handled in a later release. I will update the documentation accordingly.
Is there anywhere that the actual sequence is shown in the output. WIth the new vcf format in Straglr 1.5.2 I now see
Thanks @ljohansson for showing me an actual example, really appreciate it. The CNV:TR notation should be able to take care of complex motifs like this, and its purpose is to not report the sequence, making the file more readable. I understand that right now Straglr does not handle complex motifs well, and probably the only way to let the user know about the existence of a complex locus is by reporting the actual sequence. I'm currently working on reporting complex motifs in cases exactly like this (a release can be hopefully made before the end of the year), but before this happens maybe I should provide an option to report the actual sequence in the VCF if the user prefers?
Hi @readmanchiu. Thank you for your quick answer and suggestions. Do I understand correctly that in my example the sequence is not displayed correctly, but that is could be? For instance if RUS:GAA,GAA would be RUS:GAA,GAAGGA with the necessary adaptations in the associated numbers. Could that also account for the GAAGAA sequences at that start and end of the sequence and for sequence interruptions? I am not sure if the old method of showing the sequence in the vcf is not according to standard that it is a good idea to use that method (I'm a big fan of standards). Perhaps it is an idea to (optionally) show the sequences in the tsv file as an extra column. This would also allow to see the separate reads instead of only the sequence in the read with the median repeat length.
Ideally for the second allele a (GAA)n(GGAGAA)n(GAA)n genotype would be called. Right now my code is reporting: (GAA)5(GGAGAA)44(gaa)1(GGAGAA)67(cga)1(gaa)1(GGAGAA)27(GAA)21 it's already able to identify the "major" motifs, but it would be perfect to treat the interruptions (lowercased) as noise and not real interruption within the GGAGAA tract (which I think the second one can be re-classified as an erroneous copy, but not the first one with just a GAA) The standard does allow sequence to be reported in the VCF record, I just need to conceive some logic to decide when to do it
Hi @readmanchiu, This seems very nice already. Beware of polishing away interruptions. I have no examples, but I heard that for some repeats interruptions may affect the classification of a repeat. E.g. if you have a CGA in the middle it is not pathogenic, but if it is not it is. I am not sure about this, but it might be the case. Is the sequence shown by your code the sequence of the median read, or is it a consensus betweeen all reads in the cluster? If it would be some kind of consensus read it may be possible to distinguish sequencing errors from real interruptions.
And a different question. In the new 1.5.2 format shows the CIRUC (confidence intervals), while the 1.5.1 format showed the (ACR allelic copy ranges). How do these compare? Is it RUC +- CIRUS = ACR?
In addition, is there an option to get the AL and ALR in the output? Then I can check if the output is correct. For instance in the example when the the GAAGGA pattern is called as a GAA pattern and RUC is 300, I guess this means that there are in fact 150 GAAGGA RU? If AL would be 900 I could confirm this (in contrast to an AL of 1800).
Thank you so much for your insights/input. I will definitely let you know when the code is ready for your testing.
Thank you for all your efforts and your great tool.
Hi @readmanchiu
For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.
Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead? That would greatly help us with integrating the tool in our pipeline.