PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
101 stars 7 forks source link

Large expansions without reads covering the entire repeat region #44

Open gabeng opened 6 days ago

gabeng commented 6 days ago

Hi,

Thank you for this great tool and for diligently answering questions here. If I understand correctly, TRGT only considers reads in the analysis that span the entire repeat region. It therefore fails to report any haplotype where the repeat expansion exceeds the read length - is that correct? Are there any plans to also report evidence about these large repeats in the future? You could give at least a lower limit of its size.

Regards, Ben

egor-dolzhenko commented 6 days ago

Hi Ben,

Thanks for a great question. That's right, TRGT only uses reads that span the entire repeat region. We are definitely planning to add support for reads that partially overlap the repeat in the future versions of TRGT.

Do you by any chance have a sample with a known pathogenic repeat expansion exceeding HiFi read length? Originally we planned to add support for very long repeats much earlier, but then it turned out that all very long expansions we had access to were detectable with the current TRGT approach. Perhaps there is a tendency for long repeat expansions to be highly mosaic and hence allowing us to fully capture the expanded alleles within 15Kb+ reads? (This of course applies to known pathogenic repeats and not to other very long repeats in the human genome.)

Best wishes, Egor

gabeng commented 5 days ago

Hi Egor,

thanks for the quick response. I was hoping that you'd add this functionality. I am looking at hybrid capture data with an average read length of 3..4kb. Maybe if I extract the reads around the repeat I can share the data. I have to check. I also noticed that there are very, very few reads supporting the presence of a large expansion (compared to the other allele). My first guess was a selection in the library prep/capture process. But I cannot rule out mosaicism. It's interesting that you see that correlation in whole genome data. I do not know the exact size of the expansions, just a lower limit.

I am going to shelf my validation data for now, but will be happy to pick this up when you make modifications to the algorithm.

Thanks again! Ben

egor-dolzhenko commented 5 days ago

Hi Ben,

I see, thanks. Does your hybrid capture protocol involve PCR amplification? In my experience, PCR can lead to complete or nearly complete dropout of the expanded alleles. If you’d like, we could create a one-off version of TRGT that uses flanking reads to help evaluate your data. Let’s connect by email if this is something you’d like to explore?

Best wishes, Egor

gabeng commented 3 days ago

Hi Egor,

yes, like probably every hybrid capture protocol this one includes a few cycles of PCR. Under-representation of expanded alleles has to be expected, you are right. That is why we are looking at cranking up sensitivity as much as possible. We will have to monitor the effect on precision. I'd be happy to test any development version that you can throw at me. Thanks!

Regards, Ben