KolmogorovLab / Severus

A tool for somatic structural variant calling using long reads
Other
92 stars 4 forks source link

Error in version 0.1.2 #8

Open proteinosome opened 7 months ago

proteinosome commented 7 months ago

Hi @fenderglass and @aysegokce , this is Khi Pin. I've been trying to implement v0.1.2 in my WDL workflow (set to go public sometimes this month), but I've been running into errors where v0.1.1 had no issue:

[2024-02-08 09:17:40] INFO: Starting Severus 0.1.1
[2024-02-08 09:17:40] INFO: Parsing reads from COLO829.tumor.aligned.hiphase.bam
[2024-02-08 09:27:12] INFO:     Total read length: 196360947730
[2024-02-08 09:27:12] INFO:     Total aligned length: 196397433408 (1.00)
[2024-02-08 09:27:12] INFO:     Read N50 / N90: 20063 / 12917
[2024-02-08 09:27:12] INFO:     Alignments N50 / N90: 19557 / 11599
[2024-02-08 09:27:12] INFO:     Read error rate (Q25 / Q50 / Q75): 0.0018 / 0.0035 / 0.0067
[2024-02-08 09:27:12] INFO:     Read mismatch rate (Q25 / Q50 / Q75): 0.0004 / 0.0010 / 0.0018
[2024-02-08 09:27:14] INFO: Parsing reads from COLO829.normal.aligned.hiphase.bam
[2024-02-08 09:37:33] INFO:     Total read length: 204693733243
[2024-02-08 09:37:33] INFO:     Total aligned length: 204676920510 (1.00)
[2024-02-08 09:37:33] INFO:     Read N50 / N90: 21131 / 15095
[2024-02-08 09:37:33] INFO:     Alignments N50 / N90: 20621 / 13732
[2024-02-08 09:37:33] INFO:     Read error rate (Q25 / Q50 / Q75): 0.0017 / 0.0033 / 0.0060
[2024-02-08 09:37:33] INFO:     Read mismatch rate (Q25 / Q50 / Q75): 0.0004 / 0.0010 / 0.0017
[2024-02-08 09:37:35] INFO: Computing read quality
[2024-02-08 09:42:05] INFO: Annotating reads
[2024-02-08 09:47:06] INFO: Computing coverage histogram
[2024-02-08 09:50:17] INFO:     Median coverage by PASS reads for COLO829.tumor.aligned.hiphase.bam (H1 / H2 / H0): 19.0 / 20.0 / 0.0
[2024-02-08 09:50:18] INFO:     Median coverage by PASS reads for COLO829.normal.aligned.hiphase.bam (H1 / H2 / H0): 31.0 / 31.0 / 0.0
[2024-02-08 09:50:18] INFO: Resolving overlaps
[2024-02-08 09:50:18] INFO: Extracting split alignments
[2024-02-08 09:50:30] INFO: Extracting clipped reads
[2024-02-08 09:50:57] INFO: Clustering unmapped insertions
[2024-02-08 09:51:05] INFO: Starting breakpoint detection
[2024-02-08 09:51:39] INFO: Starting match_long_ins
[2024-02-08 09:51:39] INFO: Starting compute_bp_coverage
[2024-02-08 10:02:18] INFO: Filtering breakpoints
Traceback (most recent call last):
  File "/home/kpin/miniconda3/envs/severus/bin/severus", line 33, in <module>
    sys.exit(load_entry_point('severus==0.1.1', 'console_scripts', 'severus')())
  File "/home/kpin/miniconda3/envs/severus/lib/python3.10/site-packages/severus/main.py", line 205, in main
    double_breaks = call_breakpoints(segments_by_read, ref_lengths, coverage_histograms, bam_files, genome_ids, control_genomes, thread_pool, args)
  File "/home/kpin/miniconda3/envs/severus/lib/python3.10/site-packages/severus/breakpoint_finder.py", line 2036, in call_breakpoints
    double_breaks = double_breaks_filter(double_breaks, args.bp_min_support, cont_id)
  File "/home/kpin/miniconda3/envs/severus/lib/python3.10/site-packages/severus/breakpoint_finder.py", line 598, in double_breaks_filter
    dbs2.bp_1.spanning_reads[genome_id][ind_phased[0]] += db0.bp_1.spanning_reads[genome_id][0]
IndexError: list index out of range

real    45m39.039s
user    199m59.976s
sys 11m8.395s

Do you know why that is happening? The BAM files were aligned with pbmm2 and phased with hiphase using variants from clair3.

Thank you!

aysegokce commented 7 months ago

Hello @proteinosome,

That part is updated in the new version. Can you share how you haplotag the bams?

Ayse

proteinosome commented 7 months ago

Hi @aysegokce , Clair3 was used to call germline variants on the tumor, then used as an input to HiPhase to haplotag the BAMs.

aysegokce commented 6 months ago

Hi @proteinosome, There is a compatibility issue with HiPhase, primarily due to the difference in supplementary alignment phasing. We'll include it in the next releases. Ayse

willhooper commented 6 months ago

Hi, I'm seeing a similar issue with v01.2. My CRAMs were aligned with minimap2, germline variants called with clair3, and phasing/haplotagging was done with longphase. v0.1.1 works fine:

[2024-02-14 20:55:10] INFO: Starting Severus 0.1.1
[2024-02-14 20:55:10] INFO: Parsing reads from COLO829-T.10X.haplotagged.cram
[2024-02-14 21:01:17] INFO:     Total read length: 28095583258
[2024-02-14 21:01:17] INFO:     Total aligned length: 27992370914 (1.00)
[2024-02-14 21:01:17] INFO:     Read N50 / N90: 12194 / 6901
[2024-02-14 21:01:17] INFO:     Alignments N50 / N90: 10909 / 5562
[2024-02-14 21:01:17] INFO:     Read error rate (Q25 / Q50 / Q75): 0.0080 / 0.0121 / 0.0229
[2024-02-14 21:01:17] INFO:     Read mismatch rate (Q25 / Q50 / Q75): 0.0027 / 0.0043 / 0.0080
[2024-02-14 21:01:17] INFO: Parsing reads from COLO829-N.5X.haplotagged.cram
[2024-02-14 21:04:10] INFO:     Total read length: 8622146850
[2024-02-14 21:04:10] INFO:     Total aligned length: 8576689196 (0.99)
[2024-02-14 21:04:10] INFO:     Read N50 / N90: 10730 / 6269
[2024-02-14 21:04:10] INFO:     Alignments N50 / N90: 9425 / 5052
[2024-02-14 21:04:10] INFO:     Read error rate (Q25 / Q50 / Q75): 0.0079 / 0.0120 / 0.0227
[2024-02-14 21:04:10] INFO:     Read mismatch rate (Q25 / Q50 / Q75): 0.0027 / 0.0043 / 0.0079
[2024-02-14 21:04:10] INFO: Computing read quality
[2024-02-14 21:06:29] INFO: Annotating reads
[2024-02-14 21:08:58] INFO: Computing coverage histogram
[2024-02-14 21:09:43] INFO:     Median coverage by PASS reads for COLO829-T.10X.haplotagged.cram (H1 / H2 / H0): 2.0 / 1.0 / 1.0
[2024-02-14 21:09:45] INFO:     Median coverage by PASS reads for COLO829-N.5X.haplotagged.cram (H1 / H2 / H0): 1.0 / 1.0 / 0.0
[2024-02-14 21:09:45] INFO: Resolving overlaps
[2024-02-14 21:09:45] INFO: Extracting split alignments
[2024-02-14 21:09:50] INFO: Extracting clipped reads
[2024-02-14 21:10:04] INFO: Clustering unmapped insertions
[2024-02-14 21:10:08] INFO: Starting breakpoint detection
[2024-02-14 21:12:04] INFO: Starting match_long_ins
[2024-02-14 21:12:04] INFO: Starting compute_bp_coverage
[2024-02-14 21:16:50] INFO: Filtering breakpoints
Traceback (most recent call last):
  File "//Severus/severus.py", line 30, in <module>
    main()
  File "//Severus/severus.py", line 26, in main
    sys.exit(main())
  File "/Severus/severus/main.py", line 205, in main
    double_breaks = call_breakpoints(segments_by_read, ref_lengths, coverage_histograms, bam_files, genome_ids, control_genomes, thread_pool, args)
  File "/Severus/severus/breakpoint_finder.py", line 2036, in call_breakpoints
    double_breaks = double_breaks_filter(double_breaks, args.bp_min_support, cont_id)
  File "/Severus/severus/breakpoint_finder.py", line 601, in double_breaks_filter
    to_keep.remove(ind0)
ValueError: list.remove(x): x not in list
linsindian commented 6 months ago

Hello,

I am Sin-Dian, a member of the Longphase team.Thank you for your contributions to somatic SV calling. Recently, while testing the COLO829 data using your software, we encountered a situation similar to the one described above. According to the error report, this issue seems to originate from lines 590 to 601 under the double_breaks_filter function. From my rough understanding of this segment, it primarily deals with a special case of unphased alignment on one side of a double break. Due to Longphase considering supplementary alignments during phasing, it's possible for the primary and supplementary alignment to appear on different haplotypes after haplotagging. This consequently increases the likelihood of entering this function.I've noticed a few potential errors as follows:

  1. In the pairs stored in val, 'a' represents the phase tag, and 'b' represents the corresponding index in the to_keep list. It seems to aim to handle unphased break points, so should it be written as:
    
    #original:
    ind_unphased = [b for (a,b) in val if a]
    ind_phased = [b for (a,b) in val if  not a]

modified:

ind_unphased = [b for (a,b) in val if not a] ind_phased = [b for (a,b) in val if a]


2. In line 589, in certain situations, an "index out of range" error may occur. After comparing with the surrounding code, it seems that it should be modified to:

original:

dbs2.bp_1.spanning_reads[genome_id][ind_phased[0]] += db0.bp_1.spanning_reads[genome_id][0]

modified:

dbs2.bp_1.spanning_reads[genome_id][haplotype2[ind_phased[0]]] += db0.bp_1.spanning_reads[genome_id][0]


3. In line 601, a `list.remove(x): x not in list` error may occur. Since ind0 is a list and list.remove() cannot directly remove the list, and ind0 will only have one element, should it be written as to_keep.remove(ind0[0])?

original:

ind0 = [i for i, db in enumerate(dbs) if db == db0] to_keep.remove(ind0)

modified:

ind0 = [i for i, db in enumerate(dbs) if db == db0] to_keep.remove(ind0[0])


>I think `to_keep.remove(ind0[0])` and `to_keep.remove(ind_unphase[0])` can achieve the same result. If changed to the latter, there's no need to generate `ind0` separately.

Here's the code after summarizing the above suggestions:

ind_unphased = [b for (a,b) in val if not a] ind_phased = [b for (a,b) in val if a] db0 = dbs[ind_unphased[0]] dbs2 = dbs[ind_phased[0]] dbs2.supp_read_ids += db0.supp_read_ids dbs2.supp = len(set(dbs2.supp_read_ids)) db0.supp = 0 span_bp1 = db0.bp_1.spanning_reads[genome_id][0] dbs2.bp_1.spanning_reads[genome_id][haplotype2[ind_phased[0]]] += span_bp1 db0.bp_1.spanning_reads[genome_id][0] = 0 ind0 = [i for i, db in enumerate(dbs) if db == db0] to_keep.remove(ind_unphased[0])



Hope these suggestions can be helpful.

Best regards,
Sin-Dian
aysegokce commented 6 months ago

Hello @linsindian, Thank you for using Severus and your suggestions! We are working on improving SV phasing and extending its use with other phasing tools. It will be a part of the next release! Ayse