friend1ws / nanomonsv

SV detection tool for nanopore sequence reads
GNU General Public License v3.0
78 stars 12 forks source link

Question over code #35

Closed RenzoTale88 closed 1 year ago

RenzoTale88 commented 1 year ago

Hi, I was having a look at the code in script nanomonsv/count_sread_by_alignment.py. At lines 320-323 there is the following code:

        supporting_reads = [rname for rname in supporting_reads if \
            (alignment_info_var_1[rname][0] >= alignment_info_ref_1[rname][0] + self.var_ref_margin_thres and \
            alignment_info_var_1[rname][0] >= alignment_info_ref_2[rname][0] + self.var_ref_margin_thres) or \
            (alignment_info_var_2[rname][0] >= alignment_info_ref_1[rname][0] + self.var_ref_margin_thres and \
             alignment_info_var_2[rname][0] >= alignment_info_ref_1[rname][0] + self.var_ref_margin_thres)]

Lines 322 and 323 seems to perform the same check twice. Is this a mistake, and code should be:

        supporting_reads = [rname for rname in supporting_reads if \
            (alignment_info_var_1[rname][0] >= alignment_info_ref_1[rname][0] + self.var_ref_margin_thres and \
            alignment_info_var_1[rname][0] >= alignment_info_ref_2[rname][0] + self.var_ref_margin_thres) or \
            (alignment_info_var_2[rname][0] >= alignment_info_ref_1[rname][0] + self.var_ref_margin_thres and \
             alignment_info_var_2[rname][0] >= alignment_info_ref_2[rname][0] + self.var_ref_margin_thres)]

Otherwise, perhaps line 323 should be dropped.

Thanks Andrea

friend1ws commented 1 year ago

First, I appreciate your interest in Nanomonsv and your detailed examination of the code. I agree that we should modify the code at line 323. After implementing this adjustment, I've noticed a minor alteration in the results.

Simultaneously, I'm considering whether a more comprehensive reassessment of the entire logic is necessary. Please wait a moment for my next response. Anyway, thank you very much!

RenzoTale88 commented 1 year ago

@friend1ws thanks for confirming this. Would it be possible to have an explanation of what these thresholds filter for? thanks in advance for any help :)

friend1ws commented 1 year ago

Hi, sorry for the very late response. Now, we thoroughly investigated this point for version 0.7.0.

During the validation step, we extract segments (200bp each) from around the breakpoints of Structural Variations (SVs) to check if they exist in the sequence data for both the tumor and normal samples. However, there are some SVs with insertions. Therefore, we generate two variant segments using either the point before or after the insertion as the starting point, and perform alignment. This information is contained in: alignment_info_var_1: alignment_info_var_2:

At the same time, we also extract segments of approximately 200bp from the reference sequence around the breakpoint (reference segment), and check if these reference segments exist in the sequence data of both the tumor and normal samples. This information is stored in: alignment_info_ref_1: alignment_info_ref_2:

For example, the following condition: alignment_info_var_1[rname][0] >= alignment_info_ref_1[rname][0] + self.var_ref_margin_thres implies "variant segment 1 has aligned with the read with "rname" ID, scoring higher than reference segment 1 with a sufficient margin (self.var_ref_margin_thres)".

In strictly implementing the aforementioned conditions in this procedure, I believe the approach you pointed out was correct.

However, we discovered a phenomenon where this condition could lead to a higher matching rate in unrelated, distant regions around the breakpoint, particularly when the breakpoint involves repeats. This would, in turn, result in a lower sensitivity than usual. On the other hand, for smaller deletions and insertions, it often proved necessary to apply this condition.

Thus, as a final decision, in version 0.7.0, we decided to apply this condition for small insertions and deletions (<= 300bp) and to completely remove it for other cases. We may further alter these conditions in the future, but with this method, I believe we have achieved a slight increase in sensitivity compared to before.

Anyway, thank you very much again!