Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
182 stars 51 forks source link

Is it possible to count the reads of breakends of STR with EH outputs? #150

Open a7420174 opened 2 years ago

a7420174 commented 2 years ago

Hi, I'm running EH and QC for genotypes. I checked the VCF filter "LowDepth" but thought LowDepth is not enough for filtering. And found REViewer works very well for visualization of STR regions and some STR regions (maybe False Positive) show biased read counts in two breakends like the figure below.

NA21144 chr7_155099360_155099444

So I think it is a good point to check the read counts of breakends, but counting the reads seems not easy. Can you give some advice?

Thanks, JaeHyun

egor-dolzhenko commented 2 years ago

Hi JaeHyun,

Glad to hear that REViewer is working well for you.

We just started to implement QC metrics in REViewer. The idea is that REViewer will generate a text file with quality metrics for each analyzed repeat in addition to the read pileup plot. We just added the first quality metric, allele depth, which reports the sequencing depth of each repeat allele. This pull request contains a small example: https://github.com/Illumina/REViewer/pull/32

Does allele depth sounds like a useful metric for your analysis? If yes, I can create a binary with the new version of REViewer that reports this metric for you to test.

Best wishes, Egor

a7420174 commented 2 years ago

Hi Egor,

Thanks for your reply. It'll help a lot if you create for me. And I saw the pull request and had a question.

VariantId       AlleleDepths
BEAN1_chr16:66490398-66490458   0.00/0.40
BEAN1_chr16:66490399-66490453   18.93/20.24

Does it show one STR, not two STRs? And 0.00/0.40 means allelic depths at two breakends in one allele, right?

egor-dolzhenko commented 2 years ago

That's right, this example shows two STRs. The notation 0.00/0.40 means that the sequencing depth of one allele (both breakpoints) is 0 while the other allele's depth is 0.40. We'll try to implement a reasonable output file format and add some documentation tomorrow.

a7420174 commented 2 years ago

Oh, I see. I misunderstood.. then, besides the sequencing depth of one allele, is it possible to count the both depth of start and end point of STRs? Cause I want to find some STRs that have the big difference between the read count of two points. Sorry for the confusion

Thanks, JaeHyun

bharatij commented 2 years ago

Hi Egor, I am quite interested in QC metrics in REViewer that you are currently implementing. It can be helpful in EH analysis in one of my projects. Would it be possible to share the binaries of this new version? Thanks, Bharati

bharatij commented 2 years ago

Hi Egor, One more thing. Reviewer also estimates the Fragment length for the locus while creating plot. Is this Fragment length rough estimate of Insert size or Insert_size + Adapters? If its the case , then is it possible to add the estimated fragment length in the QC metrics file? Thanks, Bharati

egor-dolzhenko commented 2 years ago

Hi Bharati, JaeHyun,

JaeHyun: Sure, we can implement such metric.

Bharati: Sure. I will create a REViewer binary that outputs allele depth metric and will also start working on adding fragment length estimate. (REViewer calculates fragment length as the length between the first aligned position of the first/upstream mate and the last aligned position of the second/downstream mate.)

Would you mind creating the corresponding issue in the REViewer repository? We could use it to collect feedback on metrics and share binaries.

a7420174 commented 2 years ago

Sure, I'll make an issue.