Closed yttria-aniseia closed 3 months ago
hm, no, this is wrong... hisat2 does use AlnSink's SpliceSiteDB, it's output to the novelSpliceSiteOutfile
when provided.
but the structure itself is maintained even without novelSpliceSiteOutfile
and knownSpliceSiteInfile
aren't provided, because the default behavior is to use earlier splice site information for later reads. this is the --no-temp-splicesite
parameter again.
this is still a performance bug, but the patch is invalid.
improve perf scaling on large inputs (>1M reads) by not updating an unused sorted db for every align output
as far as i can tell, the SpliceSiteDB ona similar record keeping structure exists onAlnSink
is written but never read from.AlnSinkWrap
, which may be in use.SpliceSiteDB is updated for every aligned read reported to SAM (an output which is always generated?). because SpliceSiteDB is backed by a red-black tree, every update is a log(n) tree traversal with a possible rearrangement... for large numbers of reads, maintaining the structure starts to show an obvious cost.
hisat2 version 2.2.1 from bioconda for subsets of SRR12815479 on VectorBase-68_AaegyptiLVP_AGWG reference, on 8 cores, i observed the following times:
1,250 reads finished in 3.59s (0.3kread/s) 25,000 reads finished in 4.173s (5.9kread/s) 250,000 reads finished in 12.968s (19.2kread/s) 1,000,000 reads finished in 44s (22.6kread/s) 3,000,000 reads finished in 237s (12.6kread/s) 12,000,000 reads ran for over 3480+s (<3.5kread/s) before i gave up.
after this patch, i observe: 25,000 reads in 3.26s (7.7kread/s) ( 1,000,000 reads in 28.74s (34.8kread/s) (1.5x speedup) 3,000,000 reads in 89.4s (33.5kread/s) (2.65x speedup) 12,000,000 reads in 313s (38.3kread/s) (>10x speedup)