bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
69 stars 8 forks source link

Symbolic alleles in VCF output #40

Open bartcharbon opened 5 months ago

bartcharbon commented 5 months ago

Hi @readmanchiu

For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.

Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead? That would greatly help us with integrating the tool in our pipeline.

readmanchiu commented 5 months ago

Hi @bartcharbon

Thanks for checking out the VCF output. Could you provide an example of how symbolic alleles are specified so that I can implement it correctly?

bartcharbon commented 5 months ago

Hi @readmanchiu

attached is a zip file with an example vcf with symbolics, I added the straglr tsv for the same data as well. example.zip

readmanchiu commented 5 months ago

Thanks @bartcharbon for the example. Is the motivation for outputting symbolics mostly to avoid the messiness of outputting the sequence? 'cuz the information it conveys is already reported in the genotype column. I guess the number reported in the symbolic allele will just echo the number reported in the GT column. Interruptions will be ignored. Originally my concern is there will be lots of <ALT>s reported, but I guess it's OK Anyways, I will include such an options for the next release.

bartcharbon commented 4 months ago

I think that some of the older tools around (mostly short read like for example expansion hunter) report it like this, as a result some of our tooling used for downstream analysis expects this kind of output.

As well as the analysts in the lab being used to this notation.

But you also make a valid case for the use of the actual sequence, I'll take a new look with that in mind, to see if this might actually fit in our pipeline.

readmanchiu commented 4 months ago

I'm close to finish implementing this option to output symbolic alleles. My assumption is that this will only be plausible for cases where the alleles detected have the same motif as the reference. But for cases like RFC1, where the expanded alleles may have a different motif from the reference, I will still need to output the actual sequence. Does it sound alright or is there a symbolic-allele way to deal with this?

bartcharbon commented 4 months ago

Great that you are implementing this feature, thanks!

Looking at the vcf 4.2 spec symbolics are meant for "imprecise structural variants", based on that I think that even with a non reference motif symbolics are allowed to be used, since the ALT repeat motif is also in the output FORMAT fields all the information is still available in the VCF file.

Disclaimer: I'm no expert on STR's in VCF, this issue is based on differences we noticed compared to some other tools we use or used before. The cases where the motif differs from the reference are the ones we haven't seen much, due to the fact that they will not be present in control samples (either healthy or with a repeat that has the same motif as the reference) we use for testing, therefor I'm not 100% sure how other tools adress these cases.

readmanchiu commented 4 months ago

ok, I guess then the copy number reported in symbolic alleles will refer to the actual repeat unit reported, whether it's the same as the reference or not. Some loci are more complicated with interruptions and what not, which will be handled in a later release. I will update the documentation accordingly.

ljohansson commented 1 week ago

Is there anywhere that the actual sequence is shown in the output. WIth the new vcf format in Straglr 1.5.2 I now see

, where previously the sequences were shown. I have a case (FGF14) where one allele has a GAAGAA sequence and the other a GAAGGA sequence (the first is pathogenic if too long, the second not). The vcf file suggests that both alleles have the GAA pattern, which is not true. > chr13 102161576 . A , . PASS LOCUS:FGF14;RUS_REF:GAA;SVLEN:150,150;RN:1,1;RUS:GAA,GAA;RUC:302.0,351.3;CIRUC:-29.3,12.7,-5.3,12.4 GT:DP:AD 1/2:40:18/14 Allele1 (random read) > ecc85cf8-957c-4a16-b13d-894c41d9f008 > CTTTCTGTGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAGAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAAAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA Allele2 (random read) > 62b4b3f2-ec9d-4af8-b7c7-a46a396d700f > CTTTCTGTGAAGAAGAAGAAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAAGGAGAAGGAGAAGGAGAAGGAGAAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGAAGGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAACGAGAAAGGAGAAGGAGAAGGAGAAGGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA Is there a workaround? The vcf files now feel a bit 'broken' to me not knowing the actual variant.
readmanchiu commented 1 week ago

Thanks @ljohansson for showing me an actual example, really appreciate it. The CNV:TR notation should be able to take care of complex motifs like this, and its purpose is to not report the sequence, making the file more readable. I understand that right now Straglr does not handle complex motifs well, and probably the only way to let the user know about the existence of a complex locus is by reporting the actual sequence. I'm currently working on reporting complex motifs in cases exactly like this (a release can be hopefully made before the end of the year), but before this happens maybe I should provide an option to report the actual sequence in the VCF if the user prefers?

ljohansson commented 6 days ago

Hi @readmanchiu. Thank you for your quick answer and suggestions. Do I understand correctly that in my example the sequence is not displayed correctly, but that is could be? For instance if RUS:GAA,GAA would be RUS:GAA,GAAGGA with the necessary adaptations in the associated numbers. Could that also account for the GAAGAA sequences at that start and end of the sequence and for sequence interruptions? I am not sure if the old method of showing the sequence in the vcf is not according to standard that it is a good idea to use that method (I'm a big fan of standards). Perhaps it is an idea to (optionally) show the sequences in the tsv file as an extra column. This would also allow to see the separate reads instead of only the sequence in the read with the median repeat length.

readmanchiu commented 5 days ago

Ideally for the second allele a (GAA)n(GGAGAA)n(GAA)n genotype would be called. Right now my code is reporting: (GAA)5(GGAGAA)44(gaa)1(GGAGAA)67(cga)1(gaa)1(GGAGAA)27(GAA)21 it's already able to identify the "major" motifs, but it would be perfect to treat the interruptions (lowercased) as noise and not real interruption within the GGAGAA tract (which I think the second one can be re-classified as an erroneous copy, but not the first one with just a GAA) The standard does allow sequence to be reported in the VCF record, I just need to conceive some logic to decide when to do it

ljohansson commented 3 days ago

Hi @readmanchiu, This seems very nice already. Beware of polishing away interruptions. I have no examples, but I heard that for some repeats interruptions may affect the classification of a repeat. E.g. if you have a CGA in the middle it is not pathogenic, but if it is not it is. I am not sure about this, but it might be the case. Is the sequence shown by your code the sequence of the median read, or is it a consensus betweeen all reads in the cluster? If it would be some kind of consensus read it may be possible to distinguish sequencing errors from real interruptions.

And a different question. In the new 1.5.2 format shows the CIRUC (confidence intervals), while the 1.5.1 format showed the (ACR allelic copy ranges). How do these compare? Is it RUC +- CIRUS = ACR?

In addition, is there an option to get the AL and ALR in the output? Then I can check if the output is correct. For instance in the example when the the GAAGGA pattern is called as a GAA pattern and RUC is 300, I guess this means that there are in fact 150 GAAGGA RU? If AL would be 900 I could confirm this (in contrast to an AL of 1800).

readmanchiu commented 19 hours ago

Thank you so much for your insights/input. I will definitely let you know when the code is ready for your testing.

ljohansson commented 10 hours ago

Thank you for all your efforts and your great tool.