eldariont / svim

Structural Variant Identification Method using Long Reads
GNU General Public License v3.0
152 stars 19 forks source link

ERROR when using assembly-vs-assembly BAM as input #34

Closed biozzq closed 4 years ago

biozzq commented 4 years ago

Dear all,

According to some literatures, I decided to use minimap2 to do alignment and then use Sniffles and SVIM to call SV. Thes variants detected by both will be retained. However, when I using SVIM, I found that the BAM files from chromosome level assembly vs chromosome level assembly always gave me an error like following. Hope you can help me, thank you.

2020-05-02 03:19:04,661 [ERROR  ]  value too large to convert to uint32_t
Traceback (most recent call last):
  File "svim", line 165, in <module>
    sys.exit(main())
  File "svim", line 87, in main
    sv_signatures = analyze_alignment_file_coordsorted(aln_file, options)
  File "SVIM_COLLECT.py", line 138, in analyze_alignment_file_coordsorted
    supplementary_alignments = retrieve_other_alignments(current_alignment, bam)
  File "SVIM_COLLECT.py", line 82, in retrieve_other_alignments
    a.cigarstring = cigar
  File "pysam/libcalignedsegment.pyx", line 1296, in pysam.libcalignedsegment.AlignedSegment.cigarstring.__set__
  File "pysam/libcalignedsegment.pyx", line 2217, in pysam.libcalignedsegment.AlignedSegment.cigartuples.__set__
OverflowError: value too large to convert to uint32_t

Sincerely, Zheng Zhuqing

eldariont commented 4 years ago

Hi Zheng,

thank you for reporting this issue and sorry for the long delay. I was out of the office for a couple of days.

The error you encounter happens in the code of pysam, a library that SVIM uses for SAM/BAM file parsing. The problem seems to be that one element of a CIGAR string (the length of one operation to be precise) in your BAM file exceeds the maximum value of uint32_t, i.e. it is larger than 4 billion. In a human genome (which is shorter than 4Gb) this usually does not happen. Do you have an alignment in your BAM file that is larger than 4Gb?

You wrote that you are comparing chromosome-level assemblies. For this purpose, I created a fork of SVIM that you can check out here: https://github.com/eldariont/svim-asm. SVIM-asm uses the same method as SVIM but is optimized for SV calling in assemblies. That means it also uses pysam and will likely produce the same error as above for veeeeery long alignments. But once you fix that problem it will produce better results than SVIM for assemblies.

Cheers David

biozzq commented 4 years ago

Dear David,

Thank you for your response. I have escaped from this error by split the chromosome assemblies by gaps. I then ran alignment by using only contigs rather than scaffolds or chromosomes from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence that does not match perfectly to the distance in the reference. I think this would be the best solution at the moment.

Best regards, Zheng Zhuqing

eldariont commented 4 years ago

Dear Zheng,

I see, that makes a lot of sense. It's definitely better to use contigs rather than scaffolds. I'm still not sure why you got this OverflowError but if the issue is fixed for you now it's maybe not that important.

When you want to compare SVIM with Sniffles on contig alignments, I would still recommend to use SVIM-asm instead of SVIM because SVIM was designed for reads.

Cheers David