Which supporting reads number should be used for frequency

Hi.

I appreciate your interest.

Our ETCHING does not show supporting evidence frequency, though it reports CR, SR, PE, and other evidence. Instead, it reports the scores of each SV in the sixth column of the ETCHING's output vcf file. The score is calculated by machine learning models implementing random forest (or xgboost) algorithms.

If you want to define supporting read (or paired-end) frequencies as you suggested as (2+5+13)/1000, you can use it. However, the three numbers (CR, SR, and PE) are different from each other, though they are mutually proportional. If I select an essential feature, I will choose SR (supporting split-read number), because this directly supports SV in a base-pair resolution. On the other hand, PE (supporting paired-end number) may cause segment-bypassing issues for complex SVs (please see ref. [Nature 578, 112-121 (2020)]). If not complex (or simple), PE works well. Speaking about CR, it is the number of clipped reads, but not including any information of its mate breakpoint. (SV is consisted of two breakpoints from a breakpoint to its mate) For the reasons, in many studies, they use either SR or PE, some researchers use their sum (SR+PE), and some others use all of them (SR, PE, and SR+PE).

The definition of supporting read frequency depends on what you want to see. However, if you normalize any of them with read depth, the normalized number may represent allele frequency rather than SV itself.

Best regard Jang-il Sohn

ETCHING-team / ETCHING

Which supporting reads number should be used for frequency #6