Open jefferis opened 4 years ago
Hi Greg, thanks for reaching out.
I'm not quite at feature parity yet but I think it's approaching being of some use!
Implemented:
Experimental and/or untested:
A really efficient implementation could potentially achieve ~ 2x speedup if the build could be made one time only (1/3 of 75% =25% total) and the fat associated with the calling R code be removed (25% of total)
This implementation does check both of those boxes. My strategy has been to construct an "arena" which stores one query tree per neuron (along with their tangents and so on). As each tree is built once and never mutated, it can make use of the R* tree's bulk loading for improved query performance, and the layout of that tree is optimised for that point cloud. When a point cloud is added to the arena, the library returns an index to it, and any subsequent calls just use that index to minimise the amount of data being passed around. Because this lives entirely within rust, you don't have to cross the border at any point in the midst of an NxM query. Even if you're making multiple dips between them (as I do sometimes for progress checking purposes), you have to pass in very little data for each request.
This tactic could feasibly get prohibitive in the RAM cost of storing all of the trees, but I haven't noticed much of an issue so far on my laptop - if I get round to compiling this to webassembly for use in the frontend it may be more pressing.
I did not figure out how to exploit the situation of having indices for both query and target neurons
This isn't something I've looked at - as you have to iterate through every point in the query anyway, I don't have much of an intuition for how a spatial index over the query would be helpful.
Anyway, some numbers. I was using the kcs20 data set for all-by-all queries, and ChaMARCM-F000586_seg002 to FruMARCM-F000085_seg001 for single queries. On my laptop with 16 cores, 32GB RAM, benchmarking the rust directly:
test bench_all_to_all_parallel ... bench: 16,462,084 ns/iter (+/- 831,225)
test bench_all_to_all_serial ... bench: 152,795,251 ns/iter (+/- 11,454,591)
test bench_arena_construction ... bench: 926,280 ns/iter (+/- 68,726)
test bench_arena_query ... bench: 585,702 ns/iter (+/- 35,637)
test bench_arena_query_geom ... bench: 1,075,499 ns/iter (+/- 67,160)
test bench_arena_query_norm ... bench: 576,801 ns/iter (+/- 51,104)
test bench_arena_query_norm_geom ... bench: 1,061,380 ns/iter (+/- 53,128)
test bench_query ... bench: 571,131 ns/iter (+/- 31,717)
test bench_rstarpt_construction ... bench: 967,449 ns/iter (+/- 39,562)
test bench_rstarpt_construction_with_tangents ... bench: 110,743 ns/iter (+/- 8,500)
"geom" standing for geometric mean forward/backward normalisation, "norm" being self-hit normalisation. Counterintuitively, "rstarpt_construction_with_tangents" means the tangents are passed in, where "rstarpt_construction" calculates the tangents internally.
In real-world usage, I ran an all-by-all from python against the 2500ish larval brain neurons, resampled to 1um (mean 481 points). The resampling, R* construction and tangent calculation was pretty quick, even in serial. The all-by-all query took 6 minutes when parallelised over 16 cores, using under 1GB of RAM total. RAM usage for the trees should be linear in N, and compute time (plus RAM usage for the results) quadratic.
I haven't looked into any alternative spatial tree implementations yet, which is certainly a possibility going forward. It's currently using 64-bit floats internally but I can test against a lower bit depth fairly easily (a bit more work to make it generic).
I'd be happy to give it a go against one of your big neuron lists if you could give me the resampled dotprops objects (parquet format, maybe?), smat, and reference raw scores to check against. I can also put together an ipython notebook or similar with an example (until I get some actual python documentation up).
Thanks, @clbarnes! That kcs20
dataset is probably too small to be much use as a performance test, but for reference the serial implementation on my machine is ~ 190 ms or about 25% slower than your bench_all_to_all_serial.
If I understand correctly bench_all_to_all_parallel
uses the arena but I'm not sure which of the preprocessing steps are included. I normally precalculate the tangent vectors but nothing else (kd trees are not so easy to persist). For kcs20 on my machine the run time is only down to ~ 90ms – I think there's quite a bit of overhead to fork whereas I assume you are using threads, but that should quickly become unimportant as N goes up.
Here's some benchmark data that is probably a bit more representative – 250 largish neurons (order 1K nodes). It takes about 16s on my laptop (6 cores). fib250.csv.zip. If it takes less than ~ 5s for you then I think you have already achieved more optimisation than I would expect you to find easily.
smat
is the one distributed with nat.nblast, so I guess you already have it.
bench_all_to_all*
is just the query - the arena is pre-populated so the trees and tangents are already calculated. I'll give it a go on the larger data set when I have the chance, thanks!
Using 6 threads, fib250 all-by-all is taking between 16 and 25 seconds (mainly 18-19s), over 90% of which is spent in the spatial query library. Any gains from the improved tree layout, caching the spatial tree, lower-level parallelisation, and avoiding bounces between high and low level languages are more than wiped out by a slower spatial query.
A batch of optimizations giving another ~1.4x speedup beyond that is in Stoeoef/rstar#35.
- the potential large gain would be if we could spatially index both the query and target points and use that to speed up the all by all point query problem. Some time back, I experimented a bit with postgresql and an R* library (I forget which now) and the results were not encouraging, but these were really focussed on indexing all points in the dataset simultaneously, whereas we need to handle group membership i.e. each neuron. Furthermore I did not figure out how to exploit the situation of having indices for both query and target neurons. I am curious if you have managed this. I looked into the literature a while back, but it was actually quite thin and I couldn't really find implementations.
I have a very messy version doing something like this by using the separate R trees we already built for the query and target neurons and doing a DFS through the querying R tree and at each node pruning every node from the target R tree candidates with a min bounding box distance greater than the min max distance from the query bounding box to min max target bounding box. (That is, the min over target bounding boxes for the max over query bounding box's corners of the usual R tree nearest neighbor min max pruning metric.) Then leaf query nodes only need to start their search from their parent bounding box node's target candidates. This gives another small speedup of about 1.3x in random R* trees (will take some additional work to try it in nblast itself):
all to all tree lookup time: [381.54 us 384.63 us 388.60 us]
all to all point lookup time: [490.87 us 494.61 us 499.56 us]
I know the description is vague, but does that sound like what you had in mind here @jefferis?
The algorithm I described is PR'd as Stoeoef/rstar#41.
Responding to @jefferis comment at rstar here:
Dear Andrew, I realised I dropped the ball on responding to an earlier query on this.
No worries, I don't think this is urgent for anyone. I just wanted to PR it before it went stale in my brain.
However I have to say that I am rather disappointed about the final speed-up. I had honestly expected something in the range of an order of magnitude given consistent locality for many points.
I am likewise underwhelmed, but the benchmark numbers I've given so far are just for random trees in rstar and not yet the neuron data from nblast-rs.
There are still incremental improvements to the algorithm possible, such as smarter choice of which subtrees to expand for potential pruning (i.e. should it be those with maximum minimum distance to the query node so their outliers can be discarded earlier, or those providing the minimum pruning distance so that the pruning distance can be tighter), but those would be small changes in performance.
This was mostly a one-to-two-day distraction, so I've not spent time closely investigating the algorithm's behavior.
Is it possible in the end that the density of points in neuron data is just a bit low?
My intuition is that the density and distribution properties of neuron data will perform better under these optimizations than the random point cloud data from the benchmarks, since the tree structure is better segregated and thus more can be pruned (and more otherwise redundant computation avoided). However, I doubt that will lead to order of magnitude changes and is likely to be another small linear scaling factor change.
Do you know where the algorithm is spending time right now?
Once I got rid of allocation overhead in a previous PR, on last dtrace profiling it's spending the vast majority of time in distance calculations as one would expect of an efficient implementation. The best way to improve that would be SIMD optimization, which is not yet as stabilized in Rust as in C++, so the C++ lib underlying your R implementation may be making up for a less optimal algorithm with more optimal linalg. The way rstar
is written is not necessarily ideal for taking advantage of existing autovectorization in Rust. Taking full advantage of sustained SIMD would also require batching the node queries rather than doing one-by-one iteration as is currently being done. That is to say, until this becomes a hard bottleneck it's likely not worth the engineering required.
Another overarching limit to our current performance could be that our trees aren't actually that big (they could fit in L2), there's just a lot of them. Hence improvements like this that can change some O factors for individual trees don't have huge practical improvements for us because the constant factors are still outside those loops (and thus aren't reduced) and the overhead of tree construction, etc., becomes a larger and larger proportion of the remaining wall time.
Also to note if this did become a bottleneck, there's lots of interesting lit on GPU algorithms for point cloud kNN.
At this point I would try to benchmark actual performance with a large number of parallel threads/processes for jobs taking >= several minutes. I don't have super parallel speedup right now (between 4-12 threads are useful from laptop to largeish server) because the standard parallel implementation backend is based on fork and somehow some objects in memory don't seem to get shared between processes. Then you can quite quickly get into issues with swapping etc.
In an (NxN) all by all NBLAST I would be very surprised if tree construction ends up being a major fraction of compute time once N>a few hundred. However I have long been meaning to optimise memory issues by changing the pattern in which I work for matrices with thousands of neurons. This would be to do blocks of neurons rather than whole columns or rows.
At this point I would try to benchmark actual performance with a large number of parallel threads/processes for jobs taking >= several minutes.
@clbarnes I assume you'll run this on fib250 once TargetNeuron
can accommodate whole-neuron queries, but do you have a larger benchmark too? I have my 1018 to L1 dataset but that has many issues that make it not a great tuning target.
This would be to do blocks of neurons rather than whole columns or rows.
Good point, a typical cache-oblivious recursive blocking should be easy here.
Notes from our brief discussion last week to keep this issue in sync: (rough bench with both query and target trees of the same size)
Worth noting that there is now a rust implementation of nabo (the ANN library used by the reference NBLAST implementation) (and a fork of that supporting different boundary conditions), as well as a different kNN implementation which claims to beat it (although the gains may be through parallelisation which may not help very much in our case). Worth investigating.
This might be interesting too: https://github.com/cavemanloverboy/FNNTW
Written in Rust and pretty new. Benchmarks look good but I haven't tried it myself.
Should have been clearer, that was the project I linked to under "different kNN implementation" above!
My bad - Should have checked myself. Funny that we came across it independently though 😄
No significant speedups using the rust nabo implementation, unfortunately. There's probably some low-hanging fruit for optimisations in that implementation as it does quite a lot of extra array copies to cope with different lifetime restrictions in the two implementations. Fnntw is even more awkward lifetime-wise but also puts a lot of work into optimising build time with very minor gains in query time, so may not be much use to us anyway.
On the plus side, javascript bindings to webassembly is working!
Good to know! The kiddo v2 seems interesting too; annoying that rstar, kiddo v2, and bosque haven't ended up in a table together yet...
Hi @clbarnes, @schlegelp and @NikDrummond reminded me about this effort today. This is not really an issue, but I figure the discussion may as well live with the repo. I was curious about where you had got with optimisation and whether your progress is likely to be of interest for us.
The main use case that causes us trouble at the moment is when we are dealing with N>10K neurons with median 500 nodes each in "dotprops" format. My analysis in this case is that with the number of comparisons being N^2 that the costs of the preprocessing steps (making the dotprops objects etc which are ~ N) are largely irrelevant (and can often be cached).
Therefore the areas for optimisation should be the actual comparison step. The R implementation typically spends 75+% of it's time inside the
nabor::knn
function. This wraps an efficient C++ nearest neighbour library. There are a few possible ways that things could be improvedA really efficient implementation could potentially achieve ~ 2x speedup if the build could be made one time only (1/3 of 75% =25% total) and the fat associated with the calling R code be removed (25% of total). There could be an additional speed gain if the points could be converted to 32 bit or even 16 bit ints.