jefferis commented 4 years ago

Hi @clbarnes, @schlegelp and @NikDrummond reminded me about this effort today. This is not really an issue, but I figure the discussion may as well live with the repo. I was curious about where you had got with optimisation and whether your progress is likely to be of interest for us.

The main use case that causes us trouble at the moment is when we are dealing with N>10K neurons with median 500 nodes each in "dotprops" format. My analysis in this case is that with the number of comparisons being N^2 that the costs of the preprocessing steps (making the dotprops objects etc which are ~ N) are largely irrelevant (and can often be cached).

Therefore the areas for optimisation should be the actual comparison step. The R implementation typically spends 75+% of it's time inside the nabor::knn function. This wraps an efficient C++ nearest neighbour library. There are a few possible ways that things could be improved

kd trees are rebuilt for each comparison. this sounds wasteful but in my experience the query time is greater than the build time (typical splits are 1/3 build 2/3 query for the number of nodes we use), so although there is some room for improvement here, it's not huge. The downside is that keeping the kd trees in memory does have a space penalty that becomes significant when dealing with a lot of neurons.

A really efficient implementation could potentially achieve ~ 2x speedup if the build could be made one time only (1/3 of 75% =25% total) and the fat associated with the calling R code be removed (25% of total). There could be an additional speed gain if the points could be converted to 32 bit or even 16 bit ints.

the potential large gain would be if we could spatially index both the query and target points and use that to speed up the all by all point query problem. Some time back, I experimented a bit with postgresql and an R* library (I forget which now) and the results were not encouraging, but these were really focussed on indexing all points in the dataset simultaneously, whereas we need to handle group membership i.e. each neuron. Furthermore I did not figure out how to exploit the situation of having indices for both query and target neurons. I am curious if you have managed this. I looked into the literature a while back, but it was actually quite thin and I couldn't really find implementations.

clbarnes commented 4 years ago

Hi Greg, thanks for reaching out.

I'm not quite at feature parity yet but I think it's approaching being of some use!

Implemented:

Results parity with reference implementation for raw queries (a good start!)
Self-hit normalisation
Tangent generation with customisable number of neighbours (or the tangents can be passed in)
Passing in your own smat
Forward/backward normalisation using arithmetic, geometric, harmonic means, min, or max
Low-copy parallelisation (none of the point clouds or trees need copying, anyway)
Neuron resampling
- Pruning N branch steps from root
- Pruning by strahler number
- Spatial resampling of unbranched slabs

Experimental and/or untested:

Pruning neuron branches containing given nodes (e.g. tagged with "not a branch"): (implemented but blocked by #16 )
Alpha values
- Generation (tested, parity with reference)
- Use in queries (implemented, untested)
smat calculation (implemented, untested)

A really efficient implementation could potentially achieve ~ 2x speedup if the build could be made one time only (1/3 of 75% =25% total) and the fat associated with the calling R code be removed (25% of total)

This implementation does check both of those boxes. My strategy has been to construct an "arena" which stores one query tree per neuron (along with their tangents and so on). As each tree is built once and never mutated, it can make use of the R* tree's bulk loading for improved query performance, and the layout of that tree is optimised for that point cloud. When a point cloud is added to the arena, the library returns an index to it, and any subsequent calls just use that index to minimise the amount of data being passed around. Because this lives entirely within rust, you don't have to cross the border at any point in the midst of an NxM query. Even if you're making multiple dips between them (as I do sometimes for progress checking purposes), you have to pass in very little data for each request.

This tactic could feasibly get prohibitive in the RAM cost of storing all of the trees, but I haven't noticed much of an issue so far on my laptop - if I get round to compiling this to webassembly for use in the frontend it may be more pressing.

I did not figure out how to exploit the situation of having indices for both query and target neurons

This isn't something I've looked at - as you have to iterate through every point in the query anyway, I don't have much of an intuition for how a spatial index over the query would be helpful.

Anyway, some numbers. I was using the kcs20 data set for all-by-all queries, and ChaMARCM-F000586_seg002 to FruMARCM-F000085_seg001 for single queries. On my laptop with 16 cores, 32GB RAM, benchmarking the rust directly:

test bench_all_to_all_parallel                ... bench:  16,462,084 ns/iter (+/- 831,225)
test bench_all_to_all_serial                  ... bench: 152,795,251 ns/iter (+/- 11,454,591)
test bench_arena_construction                 ... bench:     926,280 ns/iter (+/- 68,726)
test bench_arena_query                        ... bench:     585,702 ns/iter (+/- 35,637)
test bench_arena_query_geom                   ... bench:   1,075,499 ns/iter (+/- 67,160)
test bench_arena_query_norm                   ... bench:     576,801 ns/iter (+/- 51,104)
test bench_arena_query_norm_geom              ... bench:   1,061,380 ns/iter (+/- 53,128)
test bench_query                              ... bench:     571,131 ns/iter (+/- 31,717)
test bench_rstarpt_construction               ... bench:     967,449 ns/iter (+/- 39,562)
test bench_rstarpt_construction_with_tangents ... bench:     110,743 ns/iter (+/- 8,500)

"geom" standing for geometric mean forward/backward normalisation, "norm" being self-hit normalisation. Counterintuitively, "rstarpt_construction_with_tangents" means the tangents are passed in, where "rstarpt_construction" calculates the tangents internally.

In real-world usage, I ran an all-by-all from python against the 2500ish larval brain neurons, resampled to 1um (mean 481 points). The resampling, R* construction and tangent calculation was pretty quick, even in serial. The all-by-all query took 6 minutes when parallelised over 16 cores, using under 1GB of RAM total. RAM usage for the trees should be linear in N, and compute time (plus RAM usage for the results) quadratic.

I haven't looked into any alternative spatial tree implementations yet, which is certainly a possibility going forward. It's currently using 64-bit floats internally but I can test against a lower bit depth fairly easily (a bit more work to make it generic).

I'd be happy to give it a go against one of your big neuron lists if you could give me the resampled dotprops objects (parquet format, maybe?), smat, and reference raw scores to check against. I can also put together an ipython notebook or similar with an example (until I get some actual python documentation up).

jefferis commented 4 years ago

Thanks, @clbarnes! That kcs20 dataset is probably too small to be much use as a performance test, but for reference the serial implementation on my machine is ~ 190 ms or about 25% slower than your bench_all_to_all_serial.

If I understand correctly bench_all_to_all_parallel uses the arena but I'm not sure which of the preprocessing steps are included. I normally precalculate the tangent vectors but nothing else (kd trees are not so easy to persist). For kcs20 on my machine the run time is only down to ~ 90ms – I think there's quite a bit of overhead to fork whereas I assume you are using threads, but that should quickly become unimportant as N goes up.

Here's some benchmark data that is probably a bit more representative – 250 largish neurons (order 1K nodes). It takes about 16s on my laptop (6 cores). fib250.csv.zip. If it takes less than ~ 5s for you then I think you have already achieved more optimisation than I would expect you to find easily.

fib250.aba.csv.zip

smat is the one distributed with nat.nblast, so I guess you already have it.

clbarnes commented 4 years ago

bench_all_to_all* is just the query - the arena is pre-populated so the trees and tangents are already calculated. I'll give it a go on the larger data set when I have the chance, thanks!

clbarnes commented 4 years ago

Using 6 threads, fib250 all-by-all is taking between 16 and 25 seconds (mainly 18-19s), over 90% of which is spent in the spatial query library. Any gains from the improved tree layout, caching the spatial tree, lower-level parallelisation, and avoiding bounces between high and low level languages are more than wiped out by a slower spatial query.

clbarnes commented 4 years ago

21 took it down to ~14s using 6 threads (thanks @aschampion !).

aschampion commented 4 years ago

A batch of optimizations giving another ~1.4x speedup beyond that is in Stoeoef/rstar#35.

the potential large gain would be if we could spatially index both the query and target points and use that to speed up the all by all point query problem. Some time back, I experimented a bit with postgresql and an R* library (I forget which now) and the results were not encouraging, but these were really focussed on indexing all points in the dataset simultaneously, whereas we need to handle group membership i.e. each neuron. Furthermore I did not figure out how to exploit the situation of having indices for both query and target neurons. I am curious if you have managed this. I looked into the literature a while back, but it was actually quite thin and I couldn't really find implementations.

I have a very messy version doing something like this by using the separate R trees we already built for the query and target neurons and doing a DFS through the querying R tree and at each node pruning every node from the target R tree candidates with a min bounding box distance greater than the min max distance from the query bounding box to min max target bounding box. (That is, the min over target bounding boxes for the max over query bounding box's corners of the usual R tree nearest neighbor min max pruning metric.) Then leaf query nodes only need to start their search from their parent bounding box node's target candidates. This gives another small speedup of about 1.3x in random R* trees (will take some additional work to try it in nblast itself):

all to all tree lookup  time:   [381.54 us 384.63 us 388.60 us]

all to all point lookup time:   [490.87 us 494.61 us 499.56 us]

I know the description is vague, but does that sound like what you had in mind here @jefferis?

aschampion commented 4 years ago

The algorithm I described is PR'd as Stoeoef/rstar#41.

aschampion commented 4 years ago

Responding to @jefferis comment at rstar here:

Dear Andrew, I realised I dropped the ball on responding to an earlier query on this.

No worries, I don't think this is urgent for anyone. I just wanted to PR it before it went stale in my brain.

However I have to say that I am rather disappointed about the final speed-up. I had honestly expected something in the range of an order of magnitude given consistent locality for many points.

I am likewise underwhelmed, but the benchmark numbers I've given so far are just for random trees in rstar and not yet the neuron data from nblast-rs.

There are still incremental improvements to the algorithm possible, such as smarter choice of which subtrees to expand for potential pruning (i.e. should it be those with maximum minimum distance to the query node so their outliers can be discarded earlier, or those providing the minimum pruning distance so that the pruning distance can be tighter), but those would be small changes in performance.

This was mostly a one-to-two-day distraction, so I've not spent time closely investigating the algorithm's behavior.

Is it possible in the end that the density of points in neuron data is just a bit low?

My intuition is that the density and distribution properties of neuron data will perform better under these optimizations than the random point cloud data from the benchmarks, since the tree structure is better segregated and thus more can be pruned (and more otherwise redundant computation avoided). However, I doubt that will lead to order of magnitude changes and is likely to be another small linear scaling factor change.

Do you know where the algorithm is spending time right now?

Once I got rid of allocation overhead in a previous PR, on last dtrace profiling it's spending the vast majority of time in distance calculations as one would expect of an efficient implementation. The best way to improve that would be SIMD optimization, which is not yet as stabilized in Rust as in C++, so the C++ lib underlying your R implementation may be making up for a less optimal algorithm with more optimal linalg. The way rstar is written is not necessarily ideal for taking advantage of existing autovectorization in Rust. Taking full advantage of sustained SIMD would also require batching the node queries rather than doing one-by-one iteration as is currently being done. That is to say, until this becomes a hard bottleneck it's likely not worth the engineering required.

Another overarching limit to our current performance could be that our trees aren't actually that big (they could fit in L2), there's just a lot of them. Hence improvements like this that can change some O factors for individual trees don't have huge practical improvements for us because the constant factors are still outside those loops (and thus aren't reduced) and the overhead of tree construction, etc., becomes a larger and larger proportion of the remaining wall time.

aschampion commented 4 years ago

Also to note if this did become a bottleneck, there's lots of interesting lit on GPU algorithms for point cloud kNN.

jefferis commented 4 years ago

At this point I would try to benchmark actual performance with a large number of parallel threads/processes for jobs taking >= several minutes. I don't have super parallel speedup right now (between 4-12 threads are useful from laptop to largeish server) because the standard parallel implementation backend is based on fork and somehow some objects in memory don't seem to get shared between processes. Then you can quite quickly get into issues with swapping etc.

In an (NxN) all by all NBLAST I would be very surprised if tree construction ends up being a major fraction of compute time once N>a few hundred. However I have long been meaning to optimise memory issues by changing the pattern in which I work for matrices with thousands of neurons. This would be to do blocks of neurons rather than whole columns or rows.

aschampion commented 4 years ago

At this point I would try to benchmark actual performance with a large number of parallel threads/processes for jobs taking >= several minutes.

@clbarnes I assume you'll run this on fib250 once TargetNeuron can accommodate whole-neuron queries, but do you have a larger benchmark too? I have my 1018 to L1 dataset but that has many issues that make it not a great tuning target.

This would be to do blocks of neurons rather than whole columns or rows.

Good point, a typical cache-oblivious recursive blocking should be easy here.

aschampion commented 4 years ago

Notes from our brief discussion last week to keep this issue in sync: (rough bench with both query and target trees of the same size)

In the best case, perfect tree, identical scenario, this brings complexity from $O(N log N)$ to $O(N)$, at the tradeoff of a constant around 2-3. The empirical data is more complex, but I will follow up on some ideas on the dimension, tree degree, and data distribution parameterization of this after September.
While in absolute magnitude this can be a huge difference for large point sets, FAFB neurons do not have enough points to benefit from the current implementation. This implementation could perform better for small trees by implementing an early termination strategy that falls back to leaf lookup shallower in the tree. The large set performance may be interesting for other organisms.
Z-order scheduling of the neuron matrix parallelization for cache obliviousness did not help in FIB250, but the total data size is quite small relative to cache size. @clbarnes will produce a larger dataset from the larval whole brain at some point.
For practical improvements with existing datasets we could instead exploit that we're generally only interested in k-top matches to terminate comparisons in worse matches early.

clbarnes commented 2 years ago

Worth noting that there is now a rust implementation of nabo (the ANN library used by the reference NBLAST implementation) (and a fork of that supporting different boundary conditions), as well as a different kNN implementation which claims to beat it (although the gains may be through parallelisation which may not help very much in our case). Worth investigating.

schlegelp commented 2 years ago

This might be interesting too: https://github.com/cavemanloverboy/FNNTW

Written in Rust and pretty new. Benchmarks look good but I haven't tried it myself.

clbarnes commented 2 years ago

Should have been clearer, that was the project I linked to under "different kNN implementation" above!

schlegelp commented 2 years ago

My bad - Should have checked myself. Funny that we came across it independently though 😄

clbarnes commented 2 years ago

No significant speedups using the rust nabo implementation, unfortunately. There's probably some low-hanging fruit for optimisations in that implementation as it does quite a lot of extra array copies to cope with different lifetime restrictions in the two implementations. Fnntw is even more awkward lifetime-wise but also puts a lot of work into optimising build time with very minor gains in query time, so may not be much use to us anyway.

On the plus side, javascript bindings to webassembly is working!

schlegelp commented 1 year ago

FYI: https://github.com/cavemanloverboy/bosque

clbarnes commented 1 year ago

Good to know! The kiddo v2 seems interesting too; annoying that rstar, kiddo v2, and bosque haven't ended up in a table together yet...

clbarnes / nblast-rs

Optimisation for large scale NBLAST #20

21 took it down to ~14s using 6 threads (thanks @aschampion !).