ETCHING-team / ETCHING

Ultra-fast, high-performing structural variation (SV) detector
http://big.hanyang.ac.kr/ETCHING
MIT License
23 stars 4 forks source link

Which supporting reads number should be used for frequency #6

Open gudeqing opened 1 year ago

gudeqing commented 1 year ago

Hi, there seems to be 3 kinds of supporting read number provided, but which one or combination should be used for frequency calculation given the depth at the breakpoint ? Suppose that the depth=1000, CR=2,SR=5,PE=13, then freq = (2+5+13)/1000 ?

##FORMAT=<ID=CR,Number=1,Type=Integer,Description="Number of clipped reads supporting the variant.">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant.">
##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant.">

Best regards!

sohnjangil commented 1 year ago

Hi.

I appreciate your interest.

Our ETCHING does not show supporting evidence frequency, though it reports CR, SR, PE, and other evidence. Instead, it reports the scores of each SV in the sixth column of the ETCHING's output vcf file. The score is calculated by machine learning models implementing random forest (or xgboost) algorithms.

If you want to define supporting read (or paired-end) frequencies as you suggested as (2+5+13)/1000, you can use it. However, the three numbers (CR, SR, and PE) are different from each other, though they are mutually proportional. If I select an essential feature, I will choose SR (supporting split-read number), because this directly supports SV in a base-pair resolution. On the other hand, PE (supporting paired-end number) may cause segment-bypassing issues for complex SVs (please see ref. [Nature 578, 112-121 (2020)]). If not complex (or simple), PE works well. Speaking about CR, it is the number of clipped reads, but not including any information of its mate breakpoint. (SV is consisted of two breakpoints from a breakpoint to its mate) For the reasons, in many studies, they use either SR or PE, some researchers use their sum (SR+PE), and some others use all of them (SR, PE, and SR+PE).

The definition of supporting read frequency depends on what you want to see. However, if you normalize any of them with read depth, the normalized number may represent allele frequency rather than SV itself.

Best regard Jang-il Sohn