Fix quadratic algorithm in merge_streams

lars-t-hansen commented 11 months ago

For the longer load reports, more than 50% of the running time is spent in merge_streams, though perf is not terribly helpful at pinpointing more than that.

Adding a little debug printing we find:

ml1.hpc.uio.no: MERGING 222 STREAMS, OUTER ITER 6936
ml8.hpc.uio.no: MERGING 9746 STREAMS, OUTER ITER 8619
ml4.hpc.uio.no: MERGING 1476 STREAMS, OUTER ITER 7461
ml9.hpc.uio.no: MERGING 61 STREAMS, OUTER ITER 8631
ml7.hpc.uio.no: MERGING 661 STREAMS, OUTER ITER 8632
ml6.hpc.uio.no: MERGING 42846 STREAMS, OUTER ITER 8631
ml3.hpc.uio.no: MERGING 629 STREAMS, OUTER ITER 8631
ml2.hpc.uio.no: MERGING 24 STREAMS, OUTER ITER 8631

That's probably more streams than I expected, but there is one stream per job and this is from data for one month, so what was I thinking.

The merging is a nested loop operating on the set of sorted-by-ascending-time streams. The outer loop first finds the lowest timestamp among the timestamps at the heads of the streams, and then it collects all records across all the streams with roughly that time and removes them. It repeats until all streams are empty. The number of iterations of the outer loop will correspond roughly to the number of time steps in the window of data we're looking at - 30 days exactly with 5 minute intervals would yield 8640 time steps.

The data structure is a vector of vectors for the streams, and a vector of ints for indices into those streams.

The first loop is therefore a search across all the vectors but it could probably be some kind of priority queue - we're only looking for a single value but we have to look at all the elements in the list.

The second loop is trickier because the selection logic is far from straightforward, but again, if the streams were sorted with lowest leading timestamp first we could probably pick off the streams at the head and ignore the rest, and not loop over everything. That said, sample times will be highly correlated and in fact the second loop may pick up records from many or indeed all of the streams.

lars-t-hansen commented 11 months ago

While there are micro-optimizations that apply in the inner loops of merge_streams, the essential insight is probably that there is usually a smallish number of jobs - really O(1) - that are active at any one time, and the trick to performance is to manage the set of sample streams such that we only look at the ones that are running: by looking at all streams at every time step we make the algorithm O(t^2) where t is the number of time steps. If we can concentrate on the live streams we can use the naive algorithms. The set of streams can probably be divided into three: the ones that are in the past, the ones that are running, and the ones that are in the future. Streams can be moved from one set to the other (this is awkward b/c rust and b/c these data structures are just Vec<Box>, not Vec<Rc>) and the scans will only process the set of the ones running.

All attempts at implementing this efficiently enough to beat the highly optimized naive loops have so far met with failure. I think in some cases the resulting code is so complicated that it confuses Rust's bounds check optimization, but that is really TBD.

I'll test the optimizations for the naive loops, which make a fair amount of difference on their own, and then land those and leave this bug open.

lars-t-hansen commented 11 months ago

I merged the micro-optimizations. They give us about a 25% improvement on 30 days of data and 10% on 90 days (the big-Oh strikes back); the new output was bitwise identical to the old and there didn't seem to be any reason to hold this back.

lars-t-hansen commented 11 months ago

I think we'll back-burner what remains here until we have more data from Fox, Saga et al. We want to know what a typical input size is. It could well be that the number of streams on the ML systems is very high because these are systems without a batch queue.

lars-t-hansen commented 1 month ago

Finally some credible data from fox. Running a 90-day report remotely with the sonalyzed cache enabled takes about 3 minutes, quite a long time and much longer than for the ML systems (about 20s for the same report iirc). For betzy this will not do. We could hack around it for betzy by partitioning in various ways - entire nodes are allocated to jobs so if a node is involved with one job at one time it is not involved with the others - but it would be better to fix the algorithm somehow.

NAICNO / Jobanalyzer

Fix quadratic algorithm in merge_streams #314