Default learning rate - Githubissues

dkobak commented 4 years ago

Embedding a large dataset (let's say n=1mln) with FIt-SNE using default parameters will yield a horrible result. Now that we understand it pretty well and now that there are papers published suggesting a very easy fix, why don't we change the default learning rate? What's the benefit of keeping it set to 200 as it was in the original t-SNE implementation?

My suggestion: if n>=10000 and if the learning rate is not explicitly set, then the wrapper sets it to n/12. The cutoff can be smaller than 10000 but in my experience smaller data sets work fine with learning rate 200, and 10000 is a nice round number.

@pavlin-policar I suggest to adopt the same convention in openTSNE. What do you guys think?

linqiaozhi commented 4 years ago

This is a good idea, and I think we should do it. Since it's changing the default behavior, I think we should make this update to be a new "release" (i.e. v1.2.0), so that if papers say they used 1.1 it will be clear they used the old default.

We should think about if there are any other lessons from your Art of t-SNE paper that should be incorporated into this new version. Should we include a warning about using random initialization as well? That is, if someone is embedding a dataset with >100,000 points (or some other threshold), but is using the default random initialization, perhaps we can output a warning that suggests they use PCA initialization?

dkobak commented 4 years ago

It's a good idea but to be honest, I would make PCA initialization the default behaviour for any sample size. Especially given our comment to Becht et al. (It is default in openTSNE, by the way).

Given that one only needs two leading PCs for the initialization (and given that FIt-SNE only supports dense matrices, so one does not need to bother with properly supporting sparse ones), I would subtract the mean and use scipy.sparse.linalg.svds to extract two components. And something similar for the Matlab/R wrappers.

pavlin-policar commented 4 years ago

I think that's a good idea. Smart defaults are always nice. However, I'll probably change the API so that in addition to some numeric learning rate it'll accept learning_rate="auto", so it will make this change. It would be strange if the learning rate would change automatically, even though the parameter value was the default 200.

dkobak commented 4 years ago

Yes, sure, any given numeric value for the learning rate should be used as is. But what's the benefit of the "auto" option, compared to learning_rate=None by default (and setting it to n/12 when it's None?)

There is one minor caveat here, and that is that such high learning rate can lead to a slow down for the interpolation. E.g. for MNIST, FIt-SNE with learning_rate=1000 converges quite a bit faster than with learning_rate=70000/12 because the latter learning rate makes the embedding expand more within the 1000 iterations, and this slows done the repulsive forces by quite a lot. While visually nothing really changes. (Unlike with learning_rate=200 which for MNIST is not enough.)

I still think that n/12 is the most reasonable choice we can use by default though. For large datasets (n~1mln), 1000 is clearly not enough, as demonstrated in our Nat Comms paper.

pavlin-policar commented 4 years ago

None is fine as well, but I like "auto". This is also the convention in scikit-learn for most things e.g. svd_solver in PCA. And if some other strategy ever comes up, it's easy to add support for another string.

dkobak commented 4 years ago

@linqiaozhi So what do you think about making PCA init default via wrappers?

Regarding the learning rate, I quickly measured the runtime on full MNIST (70000x50 input data).

%time Z = fast_tsne(X50, seed=42)
%time Z = fast_tsne(X50, seed=42, learning_rate=1000)
%time Z = fast_tsne(X50, seed=42, learning_rate=X.shape[0]/12)

The results (on my laptop):

Wall time: 59.6 s
Wall time: 1min 32s
Wall time: 3min 14s

The slowdown going from lr=1000 to lr=70000/12 is quite unfortunate (given that the embedding quality does not change much) and makes FIt-SNE lose to UMAP on this particular benchmark:

%time Z = umap.UMAP().fit_transform(X50)

runs in Wall time: 1min 8s.

Of course the t-SNE embedding is much more similar to UMAP with exaggeration ~4, and

%time Z = fast_tsne(X50, seed=42, learning_rate=X.shape[0]/12, late_exag_coeff=4, start_late_exag_iter=250)

runs in Wall time: 57.7 s because the embedding does not expand that much (despite high learning rate). But this isn't really t-SNE anymore.

I still think that lr = n/12 is the most reasonable default we can use (there is literature to cite for this choice, namely Belkina et al. + my paper), but I wanted to voice this concern. One could use some other factor, e.g. lr = n/70 to make it be exactly 1000 on MNIST, but this is hard to motivate.

linqiaozhi commented 4 years ago

@dkobak, thanks for running the benchmark, that is really good to know. I expect that actually, the embedding converges much quicker for this small dataset, and hence you don't need max_iters=1000 when the learning rate is increased. Unfortunately, we don't have good stopping criteria though.

So for MNIST, increasing the learning rate is not getting us anything, but makes us 3x slower, which is not great. It seems that we really would want to increase the learning rate for datasets that are much larger. What's the smallest dataset where you have noticed that increasing the learning rate makes a difference? Maybe the threshold should be something like 250,000?

I like the idea of making PCA initialization default. The problem is that R does not have a good randomized PCA package. rpca is what I have used in the past, but it's not parallelized, and really slow. I have a fast version, but it's not very portable. Maybe the ARPACK implementations are better; I can play with those a bit.

Given that one only needs two leading PCs for the initialization (and given that FIt-SNE only supports dense matrices, so one does not need to bother with properly supporting sparse ones), I would subtract the mean and use scipy.sparse.linalg.svds to extract two components. And something similar for the Matlab/R wrappers.

Since we only support full matrices, did you mean sklearn.decomposition.PCA and not scipy.sparse.linalg.svds?

dkobak commented 4 years ago

What's the smallest dataset where you have noticed that increasing the learning rate makes a difference? Maybe the threshold should be something like 250,000?

Well, it does actually make a lot of difference for MNIST. Look at the section "Changing the random seed used for initialization" in https://github.com/KlugerLab/FIt-SNE/blob/master/examples/test.ipynb. With the default learning rate (200), some digits are split into two parts. With learning rate set to 1000 this does not happen.

Since we only support full matrices, did you mean sklearn.decomposition.PCA and not scipy.sparse.linalg.svds?

I did mean scipy.sparse.linalg.svds. It uses ARPACK and can handle sparse matrices but it works fine with dense matrices too. It's the same as sklearn.decomposition.PCA with solver='arpack' (as per https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), so one can use that too.

I don't think one needs a randomized solver to extract 2 leading PCs, unless perhaps the dimensionality is huge. But if the dimensionality is huge, then Annoy will work forever anyway. Usually I reduce my data with PCA down to ~50 and then call FIt-SNE. I can check, but I think even for n=1mln and p=50, arpack should be fast enough.

I think ARPACK will cover most use cases, and anybody who needs something more sophisticated can provide a custom initialization. Maybe the wrappers can complain if the variance of the custom init is >1 or something... (Or even scale it down to 0.0001 by default? Using an extra wrapper parameter "scale_initialization=True" set by default to True?)

dkobak commented 4 years ago

I expect that actually, the embedding converges much quicker for this small dataset, and hence you don't need max_iters=1000 when the learning rate is increased. Unfortunately, we don't have good stopping criteria though.

Hmm, maybe actually it's a good time to introduce some.

Here is how the convergence looks like: learning_rates

With lr=70000/12 we reach the end loss of lr=1000 at iteration ~600. This is very fast; the slowdown happens afterwards. So your intuition was exactly correct. It's just not very clear to me what the stopping criterion should be... KL is not really reaching its plateau even with 1000 iterations.

dkobak commented 4 years ago

convergence

dkobak commented 4 years ago

Can one perhaps stop the iterations whenever the embedding becomes larger than some threshold? E.g. if max(abs(Y))>60 or something like that (but not more than 1000 iterations). Of course only do that if max_iter='auto', with any numeric value of max_iter overriding this stopping criterion.

I've never thought about it before. Does it make any sense?

I briefly checked, and my non-exaggerated embeddings of the Cao et al. dataset (n=2mln) have max(abs(Y)) of around 110, which is very similar to MNIST here. I think I never looked how much they change during the last few hundreds of iterations. I can look into it. Does any of you have any practical experience with this?

linqiaozhi commented 4 years ago

With the default learning rate (200), some digits are split into two parts. With learning rate set to 1000 this does not happen.

@dkobak That's a great example--I did not realize for even smaller datasets it could have that effect.

Can one perhaps stop the iterations whenever the embedding becomes larger than some threshold?

It's an interesting thought. I would have intuitively expected that the size of the embedding is dataset dependent. Or at least, dependent on the number of points. My concern is that for large datasets, there might be some very fine detail which is still "being worked out" later on in the optimization.

The example you just posted though is striking. The two embeddings look exactly the same. The fine white patterns inside each cluster look the same. Even the points between the clusters look the same. If you were to normalize them to be the same size, they would be essentially identical.

Here's another idea that occurs to me, based on that observation: at every 50 iterations, we check to see if the "normalized embedding" has changed from 50 iterations ago. That is, we normalize the embedding to be of norm 1, and then compute its L2 distance from the normalized embedding that we stored from 50 iterations ago. It would be interesting to plot that distance (we could check it by just outputting the embedding at every 50 iterations). I imagine it will plateau, and after it plateaus, we have "converged." It's a way to check if the embedding is changing (instead of just expanding).

Does that makes sense?

dkobak commented 4 years ago

It's a great idea. I really like it. Luckily it's very easy to implement in openTSNE thanks to the wonderful callback functionality. I did it just now, and found out two weird things along the way.

Here are the KLs with FIt-SNE vs openTSNE:

learning_rates

Weird thing number one is that KL behaves differently during the early exaggeration phase and decreases much faster in openTSNE. @pavlin-policar do you have any idea why this could be?

Weird thing number two is that openTSNE is slower than FIt-SNE for the default learning rate but actually becomes faster for the lr=n/12. This is because it does not slow down as much as FIt-SNE in the last iterations: with FIt-SNE the last 50 iters take 30 seconds, while in openTSNE only 14 seconds. The max(Y) after 1000 iters in openTSNE is 102; in FIt-SNE it's 114. In FIt-SNE it passes 102 around iteration 850, which takes also around 14 seconds. Looks like FIt-SNE is expanding faster and hence becomes slower. Why could this be?

Anyway, here is how the root-mean-squared-delta and root-max-squared-delta look like after normalizing by max(Y):

deltas

I don't really see a plateau, but I guess one could choose some cutoff here?..

dkobak commented 4 years ago

A simple alternative would be simply to stop at iter=500 by default :-) Or 750. I have a feeling that 1000 iterations are never needed. But I should probably look at Cao et al. data with n=2mln and plot all these things...

pavlin-policar commented 4 years ago

Very interesting discussion!

Can one perhaps stop the iterations whenever the embedding becomes larger than some threshold?

I also always thought that the span of the data set is largely arbitrary. I would expect it to be larger for large data sets.

I had made a couple of animations a long time ago to see exactly what t-SNE was doing and I found that after the first couple hundred iterations, the major clusters were formed. Then in the last couple of iterations, there weren't that many obvious changes, but some of the larger clusters sort of drifted along the edge of the embedding. I don't have any right on me right now since I had changed machines a while ago but I think I even have a notebook showing how to make an animation. I had noticed that the min_grad_norm stopping criteria that I took from scikit-learn was never actually triggered, and I wanted to know why. The thing is, if there's a large cluster slowly drifting around the embedding, the gradients aren't going to be tiny and L2 norms are going to be different. I never did really go into it very much, but it was interesting at the time.

Looking at how the neighbors of the embedding points are changing would be a better stopping criterion IMO, but that's expensive to calculate and I'm not sure it's possible to extract this information from the intermediate results we get during optimization. Intuitively, if some number of neighbors is stable in the embedding, it's probably a safe bet that the optimization has finished.

Weird thing number one is that KL behaves differently during the early exaggeration phase and decreases much faster in openTSNE.

So this is strange indeed, I can think of two things that are different, but I would be surprised if they had such a large effect. FIt-SNE does some rounding up for the number of interpolation boxes because, apparently, FFTW handles those better. I do no such thing. I don't really see how that would make the KL divergence decrease slower though. The other thing is that my optimizer always adds a min_gain momentum term, while FIt-SNE's optimizer only adds it if the gradient is too small. It would be surprising if this had such a major effect.

I've actually been wanting to play with the optimizers for a while. The choice of optimizer here is totally arbitrary and it's something that I've only ever seen t-SNE use. Maybe trying something like rmsprop or adam might result in faster convergence.

The FIt-SNE lr=200 curve seems very strange to me during the EE phase, the KL doesn't improve for 100 iterations?

dkobak commented 4 years ago

The FIt-SNE lr=200 curve seems very strange to me during the EE phase, the KL doesn't improve for 100 iterations?

This is exactly what was investigated at length in the Belkina et al. paper: https://www.nature.com/articles/s41467-019-13055-y This is what always happens for large n and small learning rate, and is the main reason for why one should increase the learning rate. See e.g. Fig 4a:

It would be surprising if this had such a major effect.

A wild guess: can it be that during the early exaggeration phase openTSNE still has factor of 4 in the gradient equation? If you remember, it was removed from openTSNE for consistency with FIt-SNE/BHt-SNE/etc, but what if it still persists during the EE phase?

I've actually been wanting to play with the optimizers for a while. The choice of optimizer here is totally arbitrary and it's something that I've only ever seen t-SNE use. Maybe trying something like rmsprop or adam might result in faster convergence.

James Melville @jlmelville played with it for quite some time, but AFAIK only for exact t-SNE: https://jlmelville.github.io/smallvis/sgd.html https://jlmelville.github.io/smallvis/sgd.html#conclusions https://jlmelville.github.io/smallvis/cg.html https://jlmelville.github.io/smallvis/specd.html https://jlmelville.github.io/smallvis/opt.html But it's an aside in this discussion :-)

dkobak commented 4 years ago

I don't really see a plateau ...

Well, that's because I chose log scale on the y axis :) Here is without log scale:

deltas-nolog

pavlin-policar commented 4 years ago

This is exactly what was investigated at length in the Belkina et al. paper.

Oh yeah, I completely forgot about that!

Can it be that during the early exaggeration phase openTSNE still has factor of 4 in the gradient equation?

No, this was entirely removed. The EE phase uses the exact same code as the regular phase, so even if it wasn't properly removed, you'd see effects during the regular phase as well.

James Melville played with it for quite some time...

I've seen this before, but these are all very small datasets. I doubt even the learning rate trick would have any noticeable effect here, even though we know it's useful.

dkobak commented 4 years ago

No, this was entirely removed. The EE phase uses the exact same code as the regular phase, so even if it wasn't properly removed, you'd see effects during the regular phase as well.

I see. It's just that I noticed how openTSNE with lr=200 in the EE phase decreases KL almost as fast but a little slower than FIt-SNE with lr=1000 and I thought that 200*4=800 explanation would fit perfectly...

pavlin-policar commented 4 years ago

One more possible reason: in my code I apply the KL correction when reporting the error so that the KL divergence is reported as though the P matrix was not exaggerated (here). Does FIt-SNE do this as well? It would make sense since the two losses are different from one another only during the EE phase.

dkobak commented 4 years ago

Yes, FIt-SNE does the same (following your example). Otherwise the values would not be even close.

To me it really looks like openTSNE and FIt-SNE are starting with the same KL in the beginning but then KL decreases much faster in openTSNE -- but only during the EE phase! If it's not the learning rate then it must be something related...

Perhaps the exaggeration itself is implemented differently and there is a scaling factor there?

dkobak commented 4 years ago

@pavlin-policar I figured it out. The random initialization has std 0.0001. When I use PCA initialization, I scale it to have std of PC1 equal to 0.0001. I did not use PCA init in the experiments above, but of course openTSNE does it by default and it turns out that you scale it by 0.01: https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/initialization.py#L61. You do the same for random: https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/initialization.py#L29. It might have been std/var confusion (?).

I tried it in FIt-SNE and this makes the difference go away! Wow. Somehow with your init it converges faster during the EE. Who would have thought.

However, the canonical implementation did use 0.0001: https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp#L150.

It seems that at least for MNIST your choice works a little better, but with the learning rate set to n/12 it does not matter too much and I have no idea how it works for other datasets. Do you want to stick to your 0.01 choice or do you want to change it to the "canonical" 0.0001?

pavlin-policar commented 4 years ago

Wow @dkobak, that's amazing. Great job tracking this down! Yes, this was certainly due to std/var confusion. Well now, that's interesting. It would be good to determine if that's just a fluke on MNIST, or if sd 0.0001 is needlessly small. Even if the high learning rate evens it out on MNIST, we should try this for larger datasets.

dkobak commented 4 years ago

Wow @dkobak, that's amazing. Great job tracking this down! Yes, this was certainly due to std/var confusion. Well now, that's interesting. It would be good to determine if that's just a fluke on MNIST, or if sd 0.0001 is needlessly small. Even if the high learning rate evens it out on MNIST, we should try this for larger datasets.

I agree it's a very curious effect and is worth investigating. However, for the time being my suggestion would be for you to change it to 0.0001 to follow the defaults in other implementations and perhaps to add initialization_std=0.0001 parameter (even though I know you don't like adding more parameters :) so that we could easily experiment with it.

In the meanwhile, I started looking at the Cao dataset, but unfortunately it's too large for openTSNE on my 32Gb RAM machine, and I was too lazy to run it on the server yet. As an aside, it'd be really great to add Annoy support to openTSNE...

Anyway, I will run it and report more details later, but for now here's a comparison of 1000 iterations with 500 iterations. Notice how the whole embedding is not larger than what we get with MNIST even though the sample size is enormous. convergence-cao And with some alpha: convergence-cao-alpha

I will compute the "deltas" as above for MNIST and report back.

linqiaozhi commented 4 years ago

Wow @dkobak, that's amazing. Great job tracking this down!

Agreed, that's a great catch.

Back to the stopping question though:

Anyway, here is how the root-mean-squared-delta and root-max-squared-delta look like after normalizing by max(Y):

Well, that's because I chose log scale on the y axis :) Here is without log scale:

Thanks for trying it out! When using the linear scale, it does look like there is an inflection point. Perhaps something like this really could work--I guess it depends on whether or not something similar is observed with the Cao data you just ran.

Notice how the whole embedding is not larger than what we get with MNIST even though the sample size is enormous.

That really is remarkable. It makes your idea of stopping when the embedding exceeds a certain threshold seem like it might work. That being said, I have been worried about the "fine detail" that might be lost if we stop prematurely for the really large datasets. For example, in your second pair of embeddings (with alpha), you can see some small differences between 500 and 1000 in the "textures" of each island. They really are very small--and perhaps not worth the extra time--but they perceptible, unlike with the MNIST example.

dkobak commented 4 years ago

By "alpha" I just meant some transparency; it's the same embeddings as above. But you are right: I also think that after 1000 iterations it looks a bit more converged than after 500. Some small specks got sharper. But of course it's possible that running it for another 1000 iterations would make it even more converged... Which is why some meaningful stopping criteria would be great to have.

dkobak commented 4 years ago

I have partial results for the Cao dataset (until iteration 700). First of all, based on visual inspection, the embedding after iter=700 is very very similar to iter=1000, and to my eye looks "as converged" as the iter=1000 (i.e. noticeably different from iter=500 shown above). The span max(Y) is just over 80.

Here are the root-mean-square-delta and max(abs(delta)) for Cao and for MNIST (where deltas are computed after normalizing the embedding at each step): deltas-cao-mnist

Based on these observations, I think that stopping

after 750 iters instead of 1000
whenever max(Y) exceeds 80 or the latest at 1000 iter
whenever mean delta drops below 0.005 or 0.004 or 0.0042 (and possibly max delta below 0.1) or the latest at 1000 iter

could all be reasonable choices. The first might feel a bit like cheating (even though it isn't). The second is good because it is so simple. The third is arguably the most principled.

It would be good to look at some other datasets though. I can provide a complete openTNSE snippet for running this based on any data X. Would any of you want to throw in some of your favorite datasets?

Here are the spans, by the way: spans-cao-mnist

pavlin-policar commented 4 years ago

Based on these observations, I think that - stopping after 750 iters instead of 1000

It seems totally reasonable to change the default number of iterations to 750 instead of 1000. I don't think there's any rationale for using 1000 other than "it's a nice number and things seem converged by then".

whenever max(Y) exceeds 80 or the latest at 1000 iter

I don't like this. This dependence seems completely arbitrary and one out-of-place point could mess everything up (not that that's ever happened).

whenever mean delta drops below 0.005 or 0.004 or 0.0042 (and possibly max delta below 0.1) or the latest at 1000 iter

I like this a lot more. But maybe looking at the max delta instead of the mean would be better since if we have a large data set and a small cluster still being formed in the late stages of optimization, the mean will hide that. Or maybe mean(top 5% largest deltas) since the max alone could be very sensitive.

I can provide a complete openTNSE snippet for running this based on any data X.

That would be great.

It's interesting that the spans all seem around the same range. This isn't what I would expect. One thought about the difference in initialization from yesterday. So the initialization with std 0.0001 expanded faster than 0.01. Could the reason be that the repulsive forces are so strong when the points are so close together that they influence the momentum terms so much, that the points move that much further away even in the later stages of optimization? Setting the momentum terms to 0 might give some indication of that.

dkobak commented 4 years ago

One thought about the difference in initialization from yesterday. So the initialization with std 0.0001 expanded faster than 0.01. Could the reason be that the repulsive forces are so strong when the points are so close together that they influence the momentum terms so much, that the points move that much further away even in the later stages of optimization? Setting the momentum terms to 0 might give some indication of that.

Interesting idea. Another difference was that convergence during EE (especially with low learning rates) was faster with 0.01 init. Maybe repulsion with 0.0001 was so strong that it prevented the convergence... Maybe other values of init std (like 1?) could be an even better choice. But this should be investigated! I would love to see some experiments, if you do any! Until then, I'd still suggest to revert to the traditional default :)

It seems totally reasonable to change the default number of iterations to 750 instead of 1000. I don't think there's any rationale for using 1000 other than "it's a nice number and things seem converged by then".

I agree in principle, but I think in practice it might look like we are trying to get a speed increase by some arbitrary hacks. For that reason I'd prefer to use George's idea with having some threshold on deltas. This could be used as a stopping criterion for any kNN embedding method, not only t-SNE, but e.g. UMAP as well.

Anyway - please find the code snippet and the output for the MNIST dataset. The code sets learning rate to n/12, runs it for 1000 iters and plots various things. I now plot mean / 99th percentile / and max of deltas. I agree that max might be too sensitive.

import numpy as np
import pylab as plt
import seaborn as sns
sns.set_style('ticks')
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger

def benchmark_convergence(X, max_iter=1000):
    Zs = []
    KLs = []
    def mycallback(iteration, error, embedding):
        Zs.append(embedding.copy())
        KLs.append(error)
    Z = TSNE(callbacks=[ErrorLogger(),mycallback], n_jobs=-1, 
             learning_rate=X.shape[0]/12, n_iter=max_iter-250).fit(X)

    iters = np.arange(50,max_iter+1,50)
    spans = [np.max(np.abs(Z)) for Z in Zs]

    deltas = np.zeros((len(Zs)-1, 3))
    for i in range(len(Zs)-1):
        A = Zs[i]/np.max(np.abs(Zs[i]))
        B = Zs[i+1]/np.max(np.abs(Zs[i+1]))
        D = np.sqrt(np.sum((A-B)**2, axis=1))
        deltas[i,0] = np.mean(D)
        deltas[i,1] = np.percentile(D,99)
        deltas[i,2] = np.max(D)

    plt.figure(figsize=(8,6))
    plt.subplot(221)
    plt.plot(iters, KLs, 'k.-')
    plt.ylabel('KL')
    plt.xlabel('Gradient descent iteration')

    plt.subplot(222)
    plt.plot(iters, spans, 'k.-')
    plt.ylabel('max(abs(Y))')
    plt.xlabel('Gradient descent iteration')

    plt.subplot(223)
    plt.plot(iters[1:], deltas, '.-')
    plt.legend(['Mean', '99 percentile', 'Max'])
    plt.ylim([0,.4])
    plt.ylabel('Delta after normalizing by max(Y)')
    plt.xlabel('Gradient descent iteration')

    plt.subplot(224)
    plt.plot(iters[1:], deltas, '.-')
    plt.legend(['Mean', '99 percentile', 'Max'])
    plt.ylim([0,.4])
    plt.ylabel('Delta after normalizing by max(Y)')
    plt.xlabel('Gradient descent iteration')
    plt.ylim([0,.02])

    sns.despine()
    plt.tight_layout()
    return Zs

mnist-convergence

Looking forward to seeing some other datasets. Ideally something much larger and possibly also something much smaller than MNIST.

dkobak commented 4 years ago

Here is the same thing for the Tasic dataset that I heavily use in the Nat Comms paper (N=23k).

tasic-convergence

The mean delta drops below 0.005 around iter=800. The 99 percentile delta drops below 0.01 around iter=1000.

I find it curious how the upper-right plot shows that in terms of the actual Y positions, the embedding is nowhere near convergence at iter=1000.

dkobak commented 4 years ago

It seems totally reasonable to change the default number of iterations to 750 instead of 1000. I don't think there's any rationale for using 1000 other than "it's a nice number and things seem converged by then".

I haven't done any more experiments on this, but thinking it over in my head, I slowly gravitate towards simply replacing 1000 iterations with 750 iterations by default. All the examples above (Tasic, MNIST, Cao et al with 2mln points) look very much converged after 750 iterations. My feeling is that 500 iterations would be too few. 1000 are arguably too many and lead to a strong slow-down for MNIST and similar datasets. So 750 looks a good compromise. An advantage is that it is super simple and will not require any specific convergence thresholds.

Two additional considerations: (1) the stopping criterion in the Belkina paper is that KL divergence decreases less than 0.05% from one iteration to the next. If I understand Figure S4 correctly, this resulted in ~450 iterations for their data. (2) UMAP's default n_epochs parameter is 500 for small datasets and 200 for large datasets (as per https://umap-learn.readthedocs.io/en/latest/api.html). So it's not like they are using 1000 iterations. People naturally want to compare runtimes between UMAP and FIt-SNE, but there is no reason to hold to 1000 to make it fair.

What do you think?

pavlin-policar commented 4 years ago

I actually like the embedding delta convergence criteria quite a bit. It makes intuitive sense i.e. when the points stop moving such that visually, we don't notice the change, we're done. Maybe another max_iter="auto"-style parameter would be useful since we're already introducing a lot of "autos" :) The 0.05% change in KL is less intuitive to me.

I think that if you wanted a truly fair comparison in runtime between the two methods, you'd need to apply the same stopping criteria to both methods. I suspect that 500 and 200 iterations in UMAP were determined in much the same way as 1000 for t-SNE, so on no real basis other than it seemed to work well.

dkobak commented 4 years ago

I agree. But what else would you want to see / try out, in order to determine an appropriate default convergence criterion?

One easy option is to reduce the number of default iterations down to 750 because it only takes a few minutes to implement and I don't see any disadvantages with that anymore, and then take some more time to explore the convergence criteria.

dkobak commented 4 years ago

@pavlin-policar George and me just had a quick discussion about it and we thought that it makes sense to go the easy way and change the default number of iterations to 750. We like the "embedding delta" convergence criteria in principle, but feel that this would need some further research to establish what a good metric and a good cutoff would be (across many datasets). This might be an interesting project in itself, but for now it's better to go with the safe option and simply change the default number of iterations.

So the suggestion for the default values would be: learning rate = max(200, n/12), n_iter = 250 for early exaggeration + 500 after it's turned off (750 in total).

pavlin-policar commented 4 years ago

Yes, I totally agree. I'll update the default number of iterations so we'll stay consistent. For the learning rate, I'll add an "auto" option for max(200, n/12) and set it as the default, as previously discussed.

We like the "embedding delta" convergence criteria in principle but feel that this would need some further research to establish what a good metric and a good cutoff would be (across many datasets).

I've got a small collection of single-cell datasets that I've used, so I could run this to check if needed. I've still got your code snippet that I've been meaning to run on these for a while now.

dkobak commented 4 years ago

Sounds good.

Do you want to set the std of random/PCA init to 0.0001 as in other implementations or keep it at 0.01?

I've got a small collection of single-cell datasets that I've used, so I could run this to check if needed. I've still got your code snippet that I've been meaning to run on these for a while now.

That would be interesting, but my code snippet should be supplemented with actually plotting the embeddings at different iterations, to be able to say when the embedding "looks" converged enough.

pavlin-policar commented 4 years ago

Do you want to set the std of random/PCA init to 0.0001 as in other implementations or keep it at 0.01?

Yes, I'll make a release implementing this as well when I return home from a conference.

That would be interesting, but my code snippet should be supplemented with actually plotting the embeddings at different iterations, to be able to say when the embedding "looks" converged enough.

I like the criteria because it's very intuitive and I think correlates closely with what we would visually say. So, when the changes in point positions are small, things essentially stop moving, which is when I would visually say that it's converged as well. But yeah, looking at the results is easy enough.

linqiaozhi commented 4 years ago

This issue originally raised has been resolved in #93, but I think that the discussion here could very well turn into a smarter solution for stopping t-SNE. Will close now, but we should keep this in mind as something that should be researched further.

If you think the adaptive stopping criteria should stay open as an issue, perhaps we could make it a new one? This issues name is "default learning rate" which makes it hard to find.

KlugerLab / FIt-SNE

Default learning rate #88