DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
478 stars 119 forks source link

fix hisat2 performance: remove unused SpliceSiteDB from AlnSink #439

Closed yttria-aniseia closed 3 months ago

yttria-aniseia commented 3 months ago

improve perf scaling on large inputs (>1M reads) by not updating an unused sorted db for every align output

as far as i can tell, the SpliceSiteDB on AlnSink is written but never read from. a similar record keeping structure exists on AlnSinkWrap, which may be in use.

SpliceSiteDB is updated for every aligned read reported to SAM (an output which is always generated?). because SpliceSiteDB is backed by a red-black tree, every update is a log(n) tree traversal with a possible rearrangement... for large numbers of reads, maintaining the structure starts to show an obvious cost.

hisat2 version 2.2.1 from bioconda for subsets of SRR12815479 on VectorBase-68_AaegyptiLVP_AGWG reference, on 8 cores, i observed the following times:

1,250 reads finished in 3.59s (0.3kread/s) 25,000 reads finished in 4.173s (5.9kread/s) 250,000 reads finished in 12.968s (19.2kread/s) 1,000,000 reads finished in 44s (22.6kread/s) 3,000,000 reads finished in 237s (12.6kread/s) 12,000,000 reads ran for over 3480+s (<3.5kread/s) before i gave up.

after this patch, i observe: 25,000 reads in 3.26s (7.7kread/s) ( 1,000,000 reads in 28.74s (34.8kread/s) (1.5x speedup) 3,000,000 reads in 89.4s (33.5kread/s) (2.65x speedup) 12,000,000 reads in 313s (38.3kread/s) (>10x speedup)

yttria-aniseia commented 3 months ago

hm, no, this is wrong... hisat2 does use AlnSink's SpliceSiteDB, it's output to the novelSpliceSiteOutfile when provided. but the structure itself is maintained even without novelSpliceSiteOutfile and knownSpliceSiteInfile aren't provided, because the default behavior is to use earlier splice site information for later reads. this is the --no-temp-splicesite parameter again. this is still a performance bug, but the patch is invalid.