Question about repeat size

HLHsieh commented 1 year ago

Hi there,

I am currently analyzing some samples and came across this result from one of my analyses:

##RepeatRegion=chr11-639647-640306-CCCCGCGCCCGGCCTTCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCC
#Read_Name  Allele_ID   Phasing_Confidence  Repeat_Size
DRD4-c2_620249_30430_R_14_29029_0   1   HIGH    0.0
DRD4-c2_634223_28127_F_7_27160_7    2   HIGH    9.0

I would like to confirm whether the "Repeat_Size" column refers to an estimated size or relative size compared to a reference. Additionally, I am curious as to why this algorithm reported a read without any repeat size. Based on these results, the variance appears to be somewhat large.

Any comments or suggestions would be greatly appreciated.

Best, Hsin

fangli80 commented 1 year ago

The repeat size is not related to the reference genome. If repeat size is 0, it means that the read contains less than half (if any) of the repeat unit. If the results seem incorrect, could you please send the two reads to me (fangli2718@gmail.com) so that I can have a check and improve the tool?

Thanks, Li

HLHsieh commented 1 year ago

Hi Li,

I have sent you an email and attached the reads along with extra information. I am wondering whether you have any suggestions on this issue.

Thanks, Hsin

fangli80 commented 1 year ago

Your email was in the spam box so I didn't see it. I will have a check today.

fangli80 commented 1 year ago

I can see that your repeat region is chr11:639647-640306. I used Tandem Repeat Finder to detect repeats in this region and only part of this region (chr11:639988-640194) is the repeat.

In the repeat bed file, the region between start_position and end_position should only contain the tandem repeats and should not include the nearby non-repeat regions. This was stated in the IMPORTANT NOTICE section of the README.md file.

If you change the bed file to the following one you can get the right output.

repeat_region-modified.zip

$ cat hsin.chr11-639988-640194-CGCCCCCCGCGCCCGGCCTC....ACTGTG.repeat_size.txt 
##Repeat_Region=chr11-639988-640194-CGCCCCCCGCGCCCGGCCTCCCCCAGGACCCCTGCGGCCCCGACTGTG
#Read_Name  Repeat_Size
DRD4-c2_626869_aligned_28938_R_0_31434_15   3.0
DRD4-c2_629373_aligned_182296_F_0_18129_1   3.0
DRD4-c2_631764_aligned_375705_F_12_8926_2   3.0
DRD4-c2_635146_aligned_144211_R_10_14389_7  3.0
DRD4-c2_636227_aligned_94657_F_19_15523_20  3.0
DRD4-c2_636260_aligned_310683_R_19_20398_5  3.0
DRD4-c2_637545_aligned_49735_F_7_3867_6 3.0
DRD4-c2_638893_aligned_465385_R_1_5345_1    3.0

The repeat size is three while the reference genome has four copies. This is consistent with the IGV plot of the aligned bam file:

We can see a 40-50 bp deletion in the above alignments.

I may add a warning message in the future if the bed file contains non-repeat regions. But for now, please make sure that the bed file only includes the repeat that you want to quantify.

Thanks, Li

fangli80 commented 1 year ago

TRF: https://github.com/lh3/TRF-mod

HLHsieh commented 1 year ago

Hi Li,

Thank you for your assistance. I have made the necessary corrections to the repeat bed file, ensuring that it only contains tandem repeats. This adjustment has resolved the issue and improved the performance.

However, I have an additional question regarding the identification of regions. As you mentioned, the region between the start_position and end_position should exclusively consist of tandem repeats and should not include nearby non-repeat regions. Let's consider another issue raised in #7. The GGCCCC repeat detected by RepeatMasker in hg38 is located at chr9:27573485-27573546 (source), and its sequence is as follows:

>hg38_rmsk_(GCCCCG)n range=chr9:27573485-27573546 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GCCCCGCCCCGGGCCCGCCCCCGGGCCCGCCCCGACCACGCCCCGGCCCCGGCCCCGGCCCC

It appears that this sequence does not solely consist of tandem repeats. As a result, I tried using chr9 27573528 27573546 GGCCCC, which contains only three consecutive repeats.

My question is whether the regions identified by TRF or RepeatMasker, even if they do not exclusively contain pure repeats, are still applicable to the algorithm. I would appreciate your thoughts on this matter.

Best regards, Hsin

fangli80 commented 1 year ago

You can include regions that are imperfect repeats. It doesn't need to be pure repeats.
You can directly use the repeat region identified by RepeatMasker or TRF. I personally prefer TRF over RepeatMasker because TRF sometimes can find tandem repeats that are not found by RepeatMasker.

Thanks, Li

WGLab / NanoRepeat

Question about repeat size #8