PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
107 stars 9 forks source link

AL values is not withing the range (ALLR) #48

Open jakobht96 opened 1 week ago

jakobht96 commented 1 week ago

Hi,

We are interested in RFC1 in human samples for diagnosis of CANVAS. We have a patient with a clinical CANVAS diagnosis. After we have run the analysis, we get an AL value that is lower than the ALLR range. We have discussed this and how to interpret this. We actually first thought the AL was average. But we now understand that AL maybe should be construed as an "educated guess"/consensus, where all the reads add to the evidence of the length. Is this correctly understood?

We can see that every read in this region is a bit funky and very heterogeneous.

The allele where length is outside ALLR: image

Piece from the vcf file:

GT:AL:ALLR:SD:MC:MS:AP:AM 1/2:2938,3534:2511-3181,3587-3944:29,6:0_594_0_0_0_0_0_0,40_611_46_0_2_0_9_10:1(0-2938),1(0-1174)_2(1174-1201)_1(1201-1500)_7(1500-1541)_1(1541-2545)_7(2545-2551)_2(2551-2595)_6(2595-2618)_1(2618-2690)_4(2690-2700)_2(2700-2828)_0(2828-2846)_7(2846-2856)_6(2856-2873)_1(2873-2888)_0(2888-2942)_0(2958-2971)_2(2971-2979)_1(2979-3077)_0(3077-3165)_1(3184-3514)_0(3514-3534):0.952189,0.879505:0.52,0.42

Please share some thoughts.

//Jakob

pbsena commented 1 week ago

Hello,

Thank you for sharing this very interesting RFC1 example. All the information shown in the VCF file regards the consensus sequence, which is drafted based on the multiple sequence alignment of reads assigned to alleles. This is the top sequence with higher contrast shown in trgt plot outputs. When reads are aligned to this consensus, the regions in individual reads are always colored based on the consensus segmentation of motifs, even if they are more heterogeneous and the individual read sequence does not resemble the consensus motifs.

I hope this helps interpreting the output and how plots are colored, we'd be happy to discuss these results further.

egor-dolzhenko commented 1 week ago

Thank you for sharing this example, Jakob. Just to add to Guilherme’s reply, yes your understanding is correct. ALLR is currently just the range of repeat sizes observed in reads and so, in rare cases, it might be inconsistent with the consensus allele sequence. We will add better size intervals to the list of future TRGT improvements.

Are you using ALLR to assess the confidence of an allele call? Or are you interested in profiling repeat heterogeneity / mosaicism?

Best wishes, Egor

jakobht96 commented 1 week ago

Hi both,

Thank you for your inputs, that is great. We are currently trying to validate the method for diagnostic purpose, and we would like to have some "confidence interval" or similiar, to say how certain are we that the length is correct. We also look nat the purity, which, as I understand it, is a parameter of how well the reads fits consensus, or how heterogenous the reads are, right? I can add to this example that this is actually a PureTarget where the sample has been loaded twice, because we see that RFC1 only get few reads in pathogenic repeats. Here we actually find that the consensus changes and more of the "grey" regions is resolved. But we want to find some measures that can tell our Clinical Laboratory Geneticists that if a result should be trusted or being interpreted with uncertainty.

One more suggestionis to have some of the data from the VCF file added as a legend to the plot figure (maybe you are working on that) and maybe a seperat figure could be added with a histogram?

To answer if we look for heterogeneity, we have case of that as well in FMR1, but in that case we look in both allele and waterfall plot.

Once again, thank you. I will look forward to future updates. This tool has great potential!

egor-dolzhenko commented 1 week ago

Good to know, thank you! We can help with defining some additional measures / visualizations that might help your geneticists assess a given repeat expansion call. Would you be open to moving this conversation to email so that we can discuss your data in a bit more detail?

To answer your other question, the purity score just measures how close your consensus allele sequence is to a perfect repeat. A purity score of 1.0 means that the allele is a perfect repeat, while a purity score close to 0.0 means that you are dealing with mostly non-repetitive sequence.

jakobht96 commented 4 days ago

Hi Egor,

That would be very helpful. I have send you an email.

// Jakob

egor-dolzhenko commented 3 days ago

Thank you!