Barnes-Hut by default for small N

dkobak commented 4 years ago

For small-ish sample sizes, Barnes-Hut is noticeably faster than FI. Given that FIt-SNE implements both, why not default to BH whenever N is small enough? This could be implemented in the wrappers, e.g. in Python by nbody_algo='auto'by default.

@pavlin-policar What do you think about it? openTSNE also implements multicore BH (right?), so it could similarly use negative_gradient_method='auto' by default.

Not sure what a good cutoff would be, but maybe George knows.

pavlin-policar commented 4 years ago

Yes, this speed difference was the main reason I implemented BH as well. BH is near-instantaneous on small data sets, while FIt-SNE does take a bit of time. Because we were building this into Orange, which is an interactive tool, instant is much better than fast-ish. In Orange, I set the cutoff point to 10k samples. I had tested this on a couple of data sets and was in no way systematic. Anyhow, Orange doesn't handle larger data sets very well yet, so in practice, BH is always used there.

I like the negative_gradient_method='auto' idea. I did notice at the time that the cutoffs for exact nearest neighbor search from sklearn should be different than the gradient method though.

dkobak commented 4 years ago

10000 sounds like a reasonable cutoff value.

I did notice at the time that the cutoffs for exact nearest neighbor search from sklearn should be different than the gradient method though.

Is exact nearest neighbor search noticeably faster than annoy/pynndescent for small N? I wasn't aware of that. Here in FIt-SNE we have VP trees for exact and annoy for approximate, so if VP trees are faster than annoy for small N, we could also have knn_algo='auto' by default and let the wrapper choose. But I have no idea if this makes any practical difference.

pavlin-policar commented 4 years ago

Is exact nearest neighbor search noticeably faster than annoy/pynndescent for small N?

Probably not significantly, but, again, it was a case of instant vs non-instant. pynndescent needs to go through the JIT compilation and numba is pretty big to load up, so a second is spent on that at least. sklearn has compiled cython modules that run practically instantly.

dkobak commented 4 years ago

So I briefly looked into it now, using subsets of MNIST (all default params, including updated learning rate / number of iters, as discussed elsewhere)

Barnes-Hut and FFT take around the same time for N=4000-5000. For N=3000 it's a difference of 16s vs 14s. For N=2000, it's 16 vs 8. For N=1000, it's 13 vs 4.

Based on this, I suggest to set the cutoff at 3000. For N<3000, BH is used by default. For N>=3000, FFT.

Regarding the kNN search, VP trees are indeed faster than Annoy for N<10000, but the difference is always very small (e.g. for N=10000 it's 1.4 vs 1.5 s), so I think it's not worth to set any rules here. I'd just keep using Annoy by default.

dkobak commented 4 years ago

Wow, I just found out that Barnes-Hut in openTSNE is massively faster than in FIt-SNE. I wonder why! For N=3000 in openTSNE I get 39s vs 8s for FFT vs BH (on my laptop now). Compare to 22s / 17s using FIt-SNE on the same machine. openTSNE is 2x slower with FFT, but 2x faster with BH!

This shows why the reasonable cutoff is more like N=10000 in openTSNE but more like 3000 in FIt-SNE.

Still very puzzling though. Why would BH be faster in openTSNE? It's parallelized in both, and I'm running it on all threads.

dkobak commented 4 years ago

Waaaait @linqiaozhi it looks like the BH is running on one thread only. Is this a bug?

dkobak commented 4 years ago

It's not parallelized!!! https://github.com/KlugerLab/FIt-SNE/blob/master/src/sptree.cpp#L361 Why did I think it was parallelized? I think we should add a PARALLEL_FOR there, it should be easy to do.

dkobak commented 4 years ago

So I spent the evening trying to parallelize the Barnes-Hut but failed :-(

There are two loops (for attractive and for repulsive forces) and if I parallelize either of them and run with nthreads>1 then the algorithm converges to a meaningless embedding. (And still isn't as fast as openTSNE). I think I have a hunch why the repulsion does not work (because of sum_Q that is accumulated across processes), but I have no idea why the attraction does not work.

There is Multicore tSNE that parallelized the original BH tSNE code, including the BH computations: https://github.com/DmitryUlyanov/Multicore-TSNE/blob/master/multicore_tsne/tsne.cpp#L230. It seems it needed some more serious refactoring for that...

dkobak commented 4 years ago

I fixed the multithreaded loop for the repulsive forces, it works now. But the attractive loop still does not! It works as a normal for loop, it works as a multithreaded loop with nthreads=1, but multithreaded loop with nthreads>1 returns a meaningless embedding. I just cannot see what can be wrong with it!! Would very much appreciate another pair of eyes here, I am a bit stuck.

pavlin-policar commented 4 years ago

It would be strange if openTSNE's BH was faster than the C++ implementation. I had benchmarked this against MulticoreTSNE, which is the fastest BH implementation (to my knowledge), and it was a bit slower.

I would suggest just integrating the code from MulticoreTSNE, but that uses openMP for parallelization, and I guess @linqiaozhi decided to use pthreads since they're less problematic on macs.

@dkobak You need to watch out for shared memory when doing parallelization. Just replacing the for with the parallel for won't work because in computeEdgeForces you've got shared memory all over the place. ind1, ind2, D, and pos_f are all being changed inside the loop, as well as the buff class member. You're going to need to have one instance of these per thread if you want to parallelize those.

dkobak commented 4 years ago

It would be strange if openTSNE's BH was faster than the C++ implementation. I had benchmarked this against MulticoreTSNE, which is the fastest BH implementation (to my knowledge), and it was a bit slower.

Actually I tried it yesterday at some point, and openTSNE with n_jobs=1 was quite a bit faster than FIt-SNE so I concluded that your implementation is faster even on one core. I could check this again. I did not try MulticoreTSNE.

I would suggest just integrating the code from MulticoreTSNE, but that uses openMP for parallelization, and I guess @linqiaozhi decided to use pthreads since they're less problematic on macs.

As far as I can see, FIt-SNE uses openMP too (if available): https://github.com/KlugerLab/FIt-SNE/blob/master/src/parallel_for.h

@dkobak You need to watch out for shared memory when doing parallelization. Just replacing the for with the parallel for won't work because in computeEdgeForces you've got shared memory all over the place. ind1, ind2, D, and pos_f are all being changed inside the loop, as well as the buff class member. You're going to need to have one instance of these per thread if you want to parallelize those.

Thanks! Here is my current version of this function:

void SPTree::computeEdgeForces(unsigned int *row_P, unsigned int *col_P, double *val_P, int N, double *pos_f, unsigned int nthreads) {
    // Loop over all edges in the graph

    // THIS PARALLEL_FOR DOES NOT WORK! COMPILES AND RUNS BUT THE EMBEDDING IS MEANINGLESS WITH NTHREADS>1
    //PARALLEL_FOR(nthreads, N, {
    for(unsigned int loop_i=0; loop_i<N; loop_i++) {
        unsigned int ind1 = loop_i * dimension;

        for (unsigned int i = row_P[loop_i]; i < row_P[loop_i + 1]; i++) {
            // Compute pairwise distance and Q-value
            double D = 1.0;
            unsigned int ind2 = col_P[i] * dimension;
            for (unsigned int d = 0; d < dimension; d++) buff[d] = data[ind1 + d] - data[ind2 + d];
            for (unsigned int d = 0; d < dimension; d++) D += buff[d] * buff[d];
            D = val_P[i] / D;

            // Sum positive force
            for (unsigned int d = 0; d < dimension; d++) pos_f[ind1 + d] += D * buff[d];
        }
    }
    //});
}

I changed where some of these variables are defined, compared to the original BHtSNE version, but it still doesn't work. Waaaait, you are right -- it's the buff!!! Why on Earth is it defined as a class member?! I guess I need to allocate the buff within the PARALLEL_FOR!

dkobak commented 4 years ago

@pavlin-policar Thanks again. I got rid of the buff entirely and it works now!

All my time estimates above need to be updated. On my current machine, I get:

N=5000  with nthreads=-1 -- 22s for FFT  -- 12s for BH
N=8000  with nthreads=-1 -- 25s for FFT  -- 24s for BH
N=9000  with nthreads=-1 -- 28s for FFT  -- 29s for BH
N=10000 with nthreads=-1 -- 30s for FFT  -- 36s for BH

(Multithreading works, because N=10000 with nthreads=1 takes 76s with BH.)

So I guess N=8000 could be a reasonable cutoff? I will push the multithreading code into the pending PR and update the cutoff heuristic to 8000.

At the same time, openTSNE is still a lot faster for BH! Here it is on the same data:

N=5000  with n_jobs=-1 -- 27s for FFT  -- 7s for BH
N=8000  with n_jobs=-1 -- 33s for FFT  -- 12s for BH
N=9000  with n_jobs=-1 -- 35s for FFT  -- 14s for BH
N=10000 with n_jobs=-1 -- 37s for FFT  -- 15s for BH

I have no idea how it can be so fast for BH!

pavlin-policar commented 4 years ago

I got rid of the buff entirely and it works now!

Glad to hear it! I was quite surprised to find buff there, it seemed like a weird place to put a class member.

At the same time, openTSNE is still a lot faster for BH!

That's quite surprising actually. Have you by any chance had a chance to check where the cutoff for openTSNE might be?

dkobak commented 4 years ago

No. I can try it later tonight from a different machine (my laptop).

Looking at the MulticoreTSNE code, it seems to be doing exactly what FItSNE with my multithreading is now doing, so I am not sure how it can be faster than openTSNE!

linqiaozhi commented 4 years ago

Hey @dkobak, thanks for working to multithread BH in the FIt-SNE implementation. And @pavlin-policar thanks for pointing out the shared variables.

No. I can try it later tonight from a different machine (my laptop).

Looking at the MulticoreTSNE code, it seems to be doing exactly what FItSNE with my multithreading is now doing, so I am not sure how it can be faster than openTSNE!

I would caution against spending too much time on it, though. In my experience, when you are trying to shave off seconds (in this case, only 5 seconds difference between this implementation and Pavlin's implementation for N=5,000), you end up making changes that are hardware/compiler dependent. I have spent many hours trying to squeeze seconds out of a multithreaded code, just to realize that when I tried it on a different platform, I made it worse (although admittedly, I am not an expert at writing parallelized code). And no one will use the BH implementation for N >5000 anyway.

For example, one problem that I found when "multithreading to gain seconds" is that when you run it on a machine with 32 cores (for example), the overhead from making so many extra threads makes it much slower than if it was single threaded. So, we will need to benchmark this on machines with different numbers of cores to make sure that is not happening.

The problem is especially severe for the specific code you are optimizing, because the total time that it runs is very short. You have to "rev up" the threads every iteration of t-SNE and use them for very short amount of time. This is in contrast to the nearest neighbor code, which is run once and takes a long time--so you can afford the overhead.

dkobak commented 4 years ago

Hey @linqiaozhi, I agree in principle, however pure Python code in openTSNE should run slower than C++ code, not faster. And the difference is not that small: for N=10000 it was 15s vs 36s according to my comment above. This is surely puzzling!

@pavlin-policar I just ran openTSNE with BH vs FFT on my laptop for higher Ns and the tipping point is astonishingly around N=40000! Am I doing something wrong? I really did not expect it to be so high.

Screenshot from 2020-03-12 23-52-03

pavlin-policar commented 4 years ago

@dkobak I would note that many parts of openTSNE are not written in "pure" Python, but are written in cython, which generates C files, which are then compiled. This is also what many python libraries like scikit-learn do. This enables releasing the GIL and running things in parallel, as well as to disable a lot of pyhon type-checking that the dynamic types otherwise have to do.

However, I do agree that this code should still always be slower than pure C++, because even with some compiled pieces, the code is constantly having to return to the python interpreter, and the C code that cython produces has a LOT of overhead, which any C++ implementation would not have.

Another point that might explain why my BH implementation is a bit faster is that it compared to my FIt-SNE implementation, it has to return to the Python interpreter a lot less. For FFT for example, it has to return to Python-land to call numpy's FFT routines. This would explain why BH is faster and why the FFT is relatively slower. Even so, an optimized C++ implementation will always beat my Cython implementation.

I just ran openTSNE with BH vs FFT on my laptop for higher Ns and the tipping point is astonishingly around N=40000! Am I doing something wrong? I really did not expect it to be so high.

The FFT implementation is often slowed down a bit when using lr=N/12 because the embedding grows in size faster than with lr=200, resulting in more interpolation points. And iterations may take different amounts of time, based on this. On the other hand, the BH implementation is just O(N log N), so the iteration time is basically constant throughout. It might also be that the BH gradients require a single step, which is highly parallelizable, while the FFT implementation is less so. I did not expect BH to be that good though.

The problem is especially severe for the specific code you are optimizing, because the total time that it runs is very short. You have to "rev up" the threads every iteration of t-SNE and use them for very short amount of time. This is in contrast to the nearest neighbor code, which is run once and takes a long time--so you can afford the overhead.

The FFT implementation has more routines like this, while BH has just a single parallelizable step. It would be good to compare these on a single core as well. I suspect the cutoff point would be somewhere much lower.

dkobak commented 4 years ago

It would be good to compare these on a single core as well. I suspect the cutoff point would be somewhere much lower.

Just tried it out, and yes -- on a single core already for N=10000 the FFT is a bit faster:

Screenshot from 2020-03-13 00-19-31

So for N=10000 the parallelization speed-up for BH is very strong but for FFT it's rather small.

I also tried running BH on one core in FIt-SNE (1 min 46sec) and it's still slower than BH in openTSNE, but only a little bit. The difference is much stronger on eight cores.

dkobak commented 4 years ago

@linqiaozhi What is important for us though is to decide on the cutoff for defaulting to BH vs FFT. I now put 8000 into the PR, but actually re-running it on my laptop now, 8000 looks too high.

N=8000: FFT 8 cores -- 28s, FFT 1 core -- 34s, BH 8 cores -- 36s, BH 1 core -- 68s.

N=5000: FFT 8 cores -- 17s, FFT 1 core -- 30s, BH 8 cores -- 17s, BH 1 core -- 32s.

N=4000: FFT 8 cores -- 13s, FFT 1 core -- 21s, BH 8 cores -- 11s, BH 1 core -- 21s.

Hmm, basically on my laptop I don't get any noticeable speed improvement using BH. What do you think? Should we always use FFT and never default to BH? Looks like which one is faster depends on the hardware...

dkobak commented 4 years ago

@linqiaozhi This was a long discussion, so I just wanted to draw your attention to my last comment:

Hmm, basically on my laptop I don't get any noticeable speed improvement using BH. What do you think? Should we always use FFT and never default to BH? Looks like which one is faster depends on the hardware...

I am inclined to take this out of my pending PR.

linqiaozhi commented 4 years ago

@dkobak Sorry for the delay

I am inclined to take this out of my pending PR.

I would agree. Given that this implementation is called "FIt-SNE," then I think most people would be confused if it actually runs Barnes-Hut underneath the hood. If there was a significant difference in runtimes for small N, then it might make sense, but since it's not too big of a difference...I agree we should stick with always running FIt-SNE by default. The few seconds improvement on some machines is not worth the confusion, I don't think.

Thanks for testing it so thoroughly man.

dkobak commented 4 years ago

OK. I took it out of the PR. Am closing this issue.

KlugerLab / FIt-SNE

Barnes-Hut by default for small N #89