DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 112 forks source link

Integer overflow(?) in TLENs when insert should be reasonable #365

Closed IanSudbery closed 2 years ago

IanSudbery commented 2 years ago

Hi,

Thanks for the software! I'm encountering a very strange error. I am getting alignments where the TLEN is set to 4,294,967,078 which is (2^32) + 218. The alignments are:

A00917:718:HK73HDSX2:4:2477:2112:27414  83      lcl|Ctg0232     199488  60      10S46M  =       199320  4294967078      CTCGTCTTCGTTGCTGCTGATGCTGTTTTATGGTGACTGCGTGTGGCGTACTCTCT        F,F,:F,:F,::FFFFF:F:FFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF        AS:i:-10        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:46 YS:i:-4YT:Z:CP  NH:i:1
A00917:718:HK73HDSX2:4:2477:2112:27414  163     lcl|Ctg0232     199320  60      52M     =       199488  -4294967078     GAGCCCTGTTAGATTGTCTGTAATTATCTTGTATGAAGGAGTGTTTTGGGGG    :FFFFF,FF,F:FF,FF,::,F,FFFF:,FFFF:,:::FFF::FFF:F,F,:    AS:i:-4 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:41G10      YS:i:-10        YT:Z:CPNH:i:1

I'd calculate the TLEN here to be (199488+46-199320)=214, very close to 218!! This is what makes me think there is an integer overflow, but I can't understand why there would be one. I first spotted it because samtools wouldn't import the file to BAM.

Any idea what might be causing this?

The alignment was perform with:

          hisat2 \
            --threads 8 \
            -x /shared/sudlab1/General/mirror/genomes/hisat/hassv2 \
            -1 ./VarHass-young-3.fastq.1.gz \
            -2 ./VarHass-young-3.fastq.2.gz \
            --known-splicesite-infile geneset.dir/refcoding.junctions \
            > /fastdata/mb1ims/tmp/ctmpx_wes84zhisat/VarHass-young-3.hisat2.bam \
            --novel-splicesite-outfile hisat.dir/VarHass-young-3.hisat.bam_novel_junctions

The log was:

Warning: Unsupported file format
20003669 reads; of these:
  20003669 (100.00%) were paired; of these:
    2605429 (13.02%) aligned concordantly 0 times
    15835946 (79.17%) aligned concordantly exactly 1 time
    1562294 (7.81%) aligned concordantly >1 times
    ----
    2605429 pairs aligned concordantly 0 times; of these:
      224693 (8.62%) aligned discordantly 1 time
    ----
    2380736 pairs aligned 0 times concordantly or discordantly; of these:
      4761472 mates make up the pairs; of these:
        3293217 (69.16%) aligned 0 times
        1279304 (26.87%) aligned exactly 1 time
        188951 (3.97%) aligned >1 times
91.77% overall alignment rate

The junctions file looks like:

lcl|Ctg0001     1017579 1017103 +
lcl|Ctg0001     1043215 1043052 -
lcl|Ctg0001     1044521 1043880 -
lcl|Ctg0001     1044521 1044765 -
lcl|Ctg0001     1044825 1044765 -
lcl|Ctg0001     1044825 1047380 -
lcl|Ctg0001     1047762 1047380 -
lcl|Ctg0001     1047762 1048594 -
lcl|Ctg0001     1049537 1048594 -
lcl|Ctg0001     1049537 1049751 -
lcl|Ctg0001     1049820 1049751 -
lcl|Ctg0001     1050518 1043236 -
lcl|Ctg0001     105240  105103  +
lcl|Ctg0001     105240  107951  +
lcl|Ctg0001     1074378 1073542 +
lcl|Ctg0001     1075968 1075366 +
lcl|Ctg0001     108078  107951  +
lcl|Ctg0001     108078  109293  +
lcl|Ctg0001     1091812 1090774 +
lcl|Ctg0001     10923   10846   -
lcl|Ctg0001     10923   14265   -
lcl|Ctg0001     1093263 1092128 +
lcl|Ctg0001     109431  109293  +
lcl|Ctg0001     109431  110983  +
lcl|Ctg0001     1095838 1094875 -
lcl|Ctg0001     1096022 1095918 -
lcl|Ctg0001     1096022 1096117 -
lcl|Ctg0001     1096179 1096117 -

The version information is:

$ hisat2 --version
/shared/sudlab_test03/General/test_env/bin/hisat2-align-s version 2.2.1
64-bit
Built on fv-az198-998
Sat Mar 27 17:02:15 UTC 2021
Compiler: collect2: error: ld returned 1 exit status
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY -std=c++11
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

It was installed via bioconda.

parkchanhee commented 2 years ago

@IanSudbery Each line of the splice sites file consists of chromosome name, genomic position of the flanking base on the left side of an intron, genomic position of the flanking base on the right, and strand(+, -, .)

So the left field must be smaller than the right. If the junction site is on the reverse strand, the strand field can be set to '-'.

IanSudbery commented 2 years ago

Thanks for that spot! Its not just the stranding thats wrong, the whole file is a nonsense. Sorry for the bother, I should have spotted that.