bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
61 stars 9 forks source link

Straglr could not find some TR that clearly existed #3

Closed LiShuhang-gif closed 2 years ago

LiShuhang-gif commented 2 years ago

Hi, I'm using Straglr to detect TR based on my PacBio HiFi data. My parameters are shown below:

straglr.py M1.sort.filter.bam hg38_22_XYM.fa straglr_scan_min_ins50.tsv \
 --min_ins_size 50 \
 --genotype_in_size \
 --max_num_clusters 2 \
 --min_support 2 \
 --nprocs 23

However, I found that there were some TR that I could clearly see on igV that I couldn't find with Straglr, and I wondered if I was setting the parameters incorrectly.I set the insert length to be greater than 50bp and need two reads to support it, but I think the insert sequence meets both conditions.

微信图片编辑_20211028100525

As shown in the figure, there is a tandem repeat expansion exists in both C1 and M1 and is greater than 50bp supported by 2 or more reads. Oddly, I was able to find the this tandem repeat expansion with Straglr in C1, but not in M1. I wonder why? Is there any way to improve it? Do you have any suggestions?Thank you very much!

readmanchiu commented 2 years ago

Thanks @LiShuhang-gif for reporting this. Is the insertion sequence in M1 the same as C1? Is it 'CAC' as the neighboring reference sequence?

LiShuhang-gif commented 2 years ago

Yes, as shown by igV, the insertion sequence in both samples is roughly the same. 微信图片_20211029112106 In addition, I have another question, as shown in the figure below:

3

I wonder if the size in the third to last column refers to the size of the inserted repeat sequence (or size of tandem repeat expansion) or the size of the entire tandem repeat (Including the same parts as reference and the expansion parts). Thanks!

readmanchiu commented 2 years ago

The "size" (third to last) column refers to the size of the repeat observed for that specific read.

Can you post the entire output (all the reads) of M1? because I'm seeing something strange

  1. for the first event, the copy number is 3 and size 9 bp, but the allele is 289.7, this is weird; I wonder what do the rest of the reads look like in the output;
  2. for the second event, the larger allele is shown (between 3-400bp), but the insertion size in the IGV is like 1.5kb, so the tandem repeat is only less than 1/3 of the insertion. Straglr should not report events when the novel/expansion sequence contains a significant portion that is non-repeat (in this case >2/3).

To conclude, from what I see the insertion sequence likely contains some repeat, but it doesn't look like a pure repeat expansion event.
If you want to investigate further, and don't mind sharing the data, I am happy to look into it for you. I can be reached at rchiu@bcgsc.ca

LiShuhang-gif commented 2 years ago

Thank you for your reply. In fact, Figure 2 I show is the result of C1, and this range of tandem repeat is not found in M1. I will post the results of all reads of both C1 and M1 in this region in order to investigate further.

C1:

image

M1: (The tandem repeat region which I show above in C1 is not reported in M1, but can be observed in M1 through igv)

image

As shown in the figure, in the result of M1, the coordinate jumped directly from chr11:1095111 to chr11:1097413.

readmanchiu commented 2 years ago

This is a difficult region with many complex tandem repeats. At this point, I can only do a better assessment if I can take a look at the sequences. I have put my email address above if you have no problem sending me the sequences to look at.

LiShuhang-gif commented 2 years ago

Hi, sorry for my late reply. I have already sent the problematic sequence of C1 and M1 that I mentioned above to your email address. Thank you for your help. If there is any progress, please feel free to contact me at any time. I will reply as soon as possible.

readmanchiu commented 2 years ago

Hi @LiShuhang-gif I've e-mailed the results back to you. It seems to me the insertions are captured in the results, please double check. As I said, the region is populated with complex repeats. Reads seem to contain multiple insertions/expansions. The repeats reported may also not be "as expected". The commands used to generate the results:

python straglr.py c1_chr11_1096542-1096744.bam hg38.fa c1.tsv --min_str_len 2 --max_str_len 100 --min_ins_size 50 --genotype_in_size --min_support 2
python straglr.py m1_chr11_1096542-1096744.bam hg38.fa m1.tsv --min_str_len 2 --max_str_len 100 --min_ins_size 50 --genotype_in_size --min_support 2

The code used is the latest in the repository, which I haven't tagged yet for a new release because I'm still doing some testing. Let me know if you have further questions.

LiShuhang-gif commented 2 years ago

Hi, Thank you for your prompt reply. I have tried your parameter setting. Indeed, the insertions are captured in the results. After comparing our parameter setting, the main difference I found was that I have set --max_num_clusters 2. And obviously, as shown below, in the M1 sample, the insertion which I could not find according to my parameter setting, while could be found according to your parameter setting were clustered into 3 categories (as shown in column 5).

chr11   1096551 1096744 CA  2910.4(31);1748.3(3);244.5(8)   m64061_210711_093307/168888103/ccs  1459.0  2918    9914    2910.4
chr11   1096551 1096744 CA  2910.4(31);1748.3(3);244.5(8)   m64061_210711_093307/122422002/ccs  1459.0  2918    5999    2910.4
chr11   1096551 1096744 CA  2910.4(31);1748.3(3);244.5(8)   m64061_210711_093307/79824312/ccs   1458.5  2917    9093    2910.4

I wonder if it is because I have set this clustering parameter, but the results of these loci cannot be clustered into 2 categories, so they have been removed from the results of M1? Thank you very much!

readmanchiu commented 2 years ago

setting --max_num_clusters 2 only forces the 3 1.7kb reads to be grouped with the 2.9kb ones, and subsequently decreases the reported size of the big allele from 2.9 to 2.8 kb. Overall, changing the number of clusters shouldn't really affect the number of supporting reads reported but it will change the genotype as the number of alleles, a result of clustering, might have changed.

LiShuhang-gif commented 2 years ago

Ok, thank you for your timely help. I have no further questions for the time being.