Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
182 stars 51 forks source link

About locus length #115

Open kumara3 opened 3 years ago

kumara3 commented 3 years ago

Hello,

I have a question about the repeat locus coordinate. When I change the repeat coordinate length even by 2bp upstream and downstream relative to coordinates given in your variant catalogue, for some of the sample I see a change of around 20-30 repeat units for some samples. Could you please help me interpret such results.

Regards,

egor-dolzhenko commented 3 years ago

Thank you for the question. The program can be sensitive to the accuracy of repeat coordinates. This is especially true when the sequence surrounding the repeat is very similar to the repeat itself. Could you please share coordinates of any such repeat?

kumara3 commented 3 years ago

Hello, Thank you for your reply. Coordinate : 18:53253386-53253458 (CAG repeat, This is from variant catalogue file in EH github) Changed coordinate: 18:53253384-53253460 (This is UCSC table browser). Around CAG repeat there is another repeat GGA. In such cases which one should be considered as true repeat size? Regards, Ashwani

egor-dolzhenko commented 3 years ago

Thanks for providing an example. In this case, the repeat definition from EH catalog is the better one to use. Note that the region 18:53253386-53253458 corresponds to sequence

CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG

which is a perfect repetition of CAGs. On the other hand, the sequence corresponding to 18:53253384-53253460 is

AGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA

which starts in "AG" and ends with "CA". This repeat is defined as "(CAG)*" in EH catalog, and the former sequence fits this definition much better than the latter sequence. The less exact is the repeat definition, the harder it is for EH to genotype the repeat.

Did I answer your question? Please let me know if you have any follow up questions.

egor-dolzhenko commented 3 years ago

Also, very good point about the additional repeat! We will investigate if incorporating this repeat into the catalog improves accuracy. (FYI @yjqiu, this satellite repeat might be relevant for your work).

We are working on tools for annotating coordinates of novel repeats and improving existing annotations. This work is taking a little longer than expected, but we hope to release something next year. Finally, we just released a new tool for visualizing reads supporting EH genotype calls: https://github.com/Illumina/REViewer. Perhaps it could be useful for your work.

kumara3 commented 3 years ago

Hello,

Your explanation make sense. But In my case, when I am using EH with the other coordinate, I am getting results which match more to the PCR results than variant catalogue coordinate. a) Also only the start of the repeat locus is different by 2 bases. Does the algorithm applies a penalty, if start of repeat base in the sample does not matches exact repeat definition? b) The other thing is EH uses in-repeat/ spanning reads to find the repeat size. Changing 2 bp should not change the information about number of in-repeat/spanning reads mapping to number of repeat units?

Please let me know your thoughts.

Regards, Ashwani

egor-dolzhenko commented 3 years ago

Yes, modifying reference coordinates of a repeat would result in less accurate, lower scoring alignments. To assist with this better, would you be able to generate visualizations for modified and unmodified repeats and share them with me? You are welcome to send the plots by email.

You can download a Linux binary of the visualization tool from here: https://github.com/Illumina/REViewer/releases. Please note that before running the tool, the BAMlets generated by EH need to be sorted and indexed.

kumara3 commented 3 years ago

Hello,

Thank you for your reply. I will get back on this.

Regards, Ashwani

egor-dolzhenko commented 3 years ago

This sounds good!