PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
106 stars 8 forks source link

about trgt plot on complex tr #43

Open WeiCSong opened 2 months ago

WeiCSong commented 2 months ago

Hi, I ran trgt on version 1.0 trgt annotation https://github.com/PacificBiosciences/trgt/issues/37#issuecomment-2274126461_ and one of the complex TR:

chr19 53958673 53959528 ID=chr19_53958673_53959528;MOTIFS=CCCCACCCCTC,CCAGGTACCTTCTACCAT,CA,TC,CCTCCCCCAATTTCTC,TCTCCCTCCC,CCCT,CCCTCCCCCTCCCT,TCTCTCCCTC,GTCCCT,CT,TCTCTCTCTGGATA,TG,TC,TCTCTGGTCTC;STRUC=GGGCTTCCTTCGGGTGCATCCCCAG(CCCCACCCCTC)n(CCAGGTACCTTCTACCAT)n(CA)n(TC)n(CCTCCCCCAATTTCTC)n(TCTCCCTCCC)n(CCCT)n(CCCTCCCCCTCCCT)n(TCTCTCCCTC)n(GTCCCT)n(CT)n(TCTCTCTCTGGATA)n(TG)n(TC)n(TCTCTGGTCTC)nCAGCTCCGCACTTTACCCAGCGACA

is quite confusing in trgt plot. Below is the genotype:

4_4_5_3_3_12_3_0_18_4_39_4_0_0_10,4_4_8_5_3_3_0_7_18_4_40_4_0_0_10

and below is the plot: image

It is not easy to match each TR count to the motif in the plot by bare eyes, and it seems that the order is not the same as the STRUC in the annotaion (the first and second motif on the left did not repeat four times?). I also retrieved hg38 reference sequence of this region:

chr19:53958673-53959528 TGGGCTTCCTTCGGGTGCATCCCCAGTCTCTGTGTCTGCCTCTGTTTCTCTGGATCTCTC TCTCTTCCTGTCTCAGTCCCTCTCCGTCTGTCTCTCTCTGGGTCTCCCTCTTTTCCTGCA CTTGCCTTTCTTTCCCCAGGTACCTTCTGCCATCCAGGCCCTTCTACCCTCCATTTCTTT CATTTACTATCTTGCTCCCCCGCTCCCTCTCCGCATCTTCTTCTCTCTTTAAAGCTTCCT TCTCTCTAGCAAGACCTTGCCCCCATCCCCAATTTCTCTCCCTACCCGCTCTCCATTTCA TTCCCTCTCCCCCCTCTCTCCCCCATCCTCTCAGTCTCTCTCTTTCTGTCTGTCTCTCCC CATCTGTCCCTCCTCCCTCCTCTCTGGATGACTCTCTCCCTCTCTCTTCTTTCCTTCTGT TTCCCAGATCCTGACCCCCCCCACACACACACTTACCCCAGCCCTCCCCCACCCCCTCCC CCCCAGCCCCTGCGTTTCTCTCTTTGAGTCTCTGTCCCTGTCCCCTTCTCTCTCTGGATA TTTCTCTGTGTGTATCTCTCCAACTTCCTTCTGCTTCCAGCCGCTGCCTCCCCCAAATTT CTCCCTCCCCCATTTCTGTTTCGCGGTCTCTGGGTCTCTCTCTTTCCGTTTCTCCTCGTC TCTCTCTGTCTCTCTCCCTCCCTCTCTGGATCTCTCTCTTCTCCTCCGGCTTCCTTCTGC CACTCGACCCTGCCCCCCTCTTTCCCTTCCCCCATCCATCTCCTCTCCGAGGCTCCCCAT CCCTCAGCAGCTCCCCTCCCCCTCCCTCCCTACTCCCTCCCTCTCGTCCTCCAGCTCCGC ACTTTACCCAGCGACA

and it was also not easy to match each motif and its order to the reference sequence. Not sure if my interpretation of the plot is wrong, or my sequence data is problematic. Hope to learn from you!

I also have two suggestion on the trgt plot function: 1) is it possible to plot reference TR on the bottom for better comparison? 2) the color legend seems broken since there are too many motifs, and some of them are out of the range. Can they be showed in multiple rows?

egor-dolzhenko commented 2 months ago

Thank you for the detailed feedback @WeiCSong! I agree that current TRGT plots can be confusing for very complex repeats like this. We are working on a better support for complex repeats and hopefully will have something closer to the end of the year. Can we send you some prototype plots for feedback? Also, we are about to deprecate the STRUC field (the latest versions of TRGT will report a segmentation that doesn't follow the order of motifs specified in the STRUC field if they find it).

WeiCSong commented 2 months ago

Thank you for the detailed feedback @WeiCSong! I agree that current TRGT plots can be confusing for very complex repeats like this. We are working on a better support for complex repeats and hopefully will have something closer to the end of the year. Can we send you some prototype plots for feedback? Also, we are about to deprecate the STRUC field (the latest versions of TRGT will report a segmentation that doesn't follow the order of motifs specified in the STRUC field if they find it).

Great @egor-dolzhenko ! Please send me at song628196@gmail.com

egor-dolzhenko commented 2 months ago

Thanks, will do!