Changing the defaults in the Python wrapper

dkobak commented 4 years ago

Learning rate is set to max(200, N/early_exag_coeff) by default.
Iteration number is set to 750 by default (250+500).
Initialization is set to PCA (via ARPACK) by default.
N-body algorithm is set to FFT for N>=8000 and to BH for N<8000 by default. UPDATE: I TOOK THIS OUT!
Late exaggeration start is set to the early exagg end by default (if late exagg coeff is provided).

I updated the example notebook too.

This fixes issues #88 #89 #90.

UPDATE: also implements multithreaded Barnes-Hut!

linqiaozhi commented 4 years ago

For the learning rate documentation:

Default 'auto'; it sets learning rate to N/12 where N is the sample size, or to 200 if N/12 < 200.

But the code

    if learning_rate == 'auto':
        learning_rate = np.max((200, X.shape[0]/early_exag_coeff))

By default, early_exag_coeff is indeed 12. But if it's chosen to be a different number, do we want the automatic learning rate to change as well? I suppose if you have higher exaggeration, then you are probably making bigger steps, so it probably does actually make sense to decrease the learning rate.

Just wanted to confirm with you. If so, we should probably change the documentation to reflect this. I don't feel strongly either way.

dkobak commented 4 years ago

By default, early_exag_coeff is indeed 12. But if it's chosen to be a different number, do we want the automatic learning rate to change as well?

Yes, I think we do. This heuristic of learning_rate = N / early_exag_coeff is from Belkina et al., and they based it on your work with Stefan, where you have a bound learning_rate * early_exag_coeff < N for algorithm not to diverge.

Now that I spent some time experimenting with large exaggeration, I can also confirm that using large early exaggeration and not decreasing the learning rate from N/12 will easily lead to divergence.

linqiaozhi commented 4 years ago

Okay great, then we should update that in the documentation. I can make that change.

linqiaozhi commented 4 years ago

I did come across an issue with this choice of learning rate. As a test dataset, I made two well-separated 1,000-dimensional gaussian balls with 10,000 points in each.

require(MASS);
N <- 2E4;
d <- 1000;
input_data <- rbind(mvrnorm(n = N/2, rep(0, d), diag(d)), mvrnorm(n = N/2, rep(100, d), diag(d)))

I noticed that with the new learning rate setting, the exaggeration phase was terribly slow, so I stopped the optimization at 200 iterations and looked at the embedding:

The learning rate was too high, so it was overshooting like crazy and making the embedding large, which accordingly slows down the interpolation scheme.

When I set the learning rate to 200 (previous default), after the same number of iterations, it looks as expected:

When I set the exaggeration factor to be 1 (i.e. no early exaggeration), it also looks as it should.

Here's the interesting part though. When I do the same experiment in only 10 dimensions (i.e. d=10 in the snippet above), this problem also disappears. That is, the overshooting that I am observing with learning rate of N/exaggeration_factor during the exaggeration phase does not occur when the original space is only 10 dimensions.

So, it does look we are running the risk of overshooting with this new heuristic. I don't fully understand this connection with the input dimensionality, but I wonder if it has to do with ANNOY maybe being less accurate in high dimensions. I don't typically run t-SNE in these high dimensions (as I typically do PCA first), so I don't have much experience with it.

dkobak commented 4 years ago

Hmm. I can confirm this observation. I don't have an explanation right now. I've never seen this before.

I also noticed that the final embedding with d=10 and d=1000 (with learning_rate=200) look very different from each other: d=1000 turns out very well separated, but in d=10 case the Gaussians almost touch. I don't know why this would be because I think there should be zero kNN between-cluster connections in either case.

dkobak commented 4 years ago

I also noticed that the final embedding with d=10 and d=1000 (with learning_rate=200) look very different from each other: d=1000 turns out very well separated, but in d=10 case the Gaussians almost touch. I don't know why this would be because I think there should be zero kNN between-cluster connections in either case.

I am really stupefied by this. I used VP trees to remove any possible influence of Annoy, used random init to remove any possible influence of PCA, used K=15 and sigma=1e+6 to remove any possibly influence of perplexity calibration, and used learning_rate=200 to make sure everything converges well. Still I get this:

Screenshot from 2020-03-22 12-45-36

dkobak commented 4 years ago

I must be missing something very obvious...

tmp

linqiaozhi commented 4 years ago

I can't put my finger on what it is exactly, but I expect it to be related to the distribution of the p_ijs. In high dimensions, all points are "almost equidistant" from eachother, so the distribution of the affinities will be different in high vs. low dimensions. But I don't have an explanation yet for how that results in this effect (that looks almost like an exaggeration of the affinities).

linqiaozhi commented 4 years ago

Then again, your experiment with sigma being so large means that the actual value of the non-zero p_ijs should be essentially constant. So it has to be a difference in "which neighbors" each point is getting.

dkobak commented 4 years ago

Then again, your experiment with sigma being so large means that the actual value of the non-zero p_ijs should be essentially constant. So it has to be a difference in "which neighbors" each point is getting.

Exactly... but HOW?

dkobak commented 4 years ago

So it has to be a difference in "which neighbors" each point is getting.

I think something funny happens during the symmetrization step. The affinity matrix for d=2 has up to 26 non-zero elements in a row. (I use k=15, so it's more than 15 due to symmetrization). But for d=100, the max number of non-zero elements in one row is 407. I have no idea if/how this can make a big difference. The means across rows are 17 and 27 respectively, so not hugely different.

dkobak commented 4 years ago

I quickly checked what happens with openTSNE and I could not reproduce the divergence problem.

n = 20000
d = 1000
X = np.random.randn(n,d)
X[int(n/2):,:] += 100
Z = TSNE(verbose=1, n_jobs=-1).fit(X)

This ran just fine and produced the expected embedding.

Here is the output:

--------------------------------------------------------------------------------
TSNE(callbacks=None, callbacks_every_iters=50, early_exaggeration=12,
     early_exaggeration_iter=250, exaggeration=None, final_momentum=0.8,
     initial_momentum=0.5, initialization='pca', ints_in_interval=1,
     learning_rate='auto', max_grad_norm=None, metric='euclidean',
     metric_params=None, min_num_intervals=50, n_components=2,
     n_interpolation_points=3, n_iter=500, n_jobs=-1,
     negative_gradient_method='fft', neighbors=None, perplexity=30,
     random_state=None, theta=0.5, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 20.60 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.76 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.76 seconds
===> Running optimization with exaggeration=12, lr=1666.6666666666667 for 250 iterations...
Iteration   50, KL divergence 5.0988, 50 iterations in 1.4479 sec
Iteration  100, KL divergence 4.7109, 50 iterations in 4.3490 sec
Iteration  150, KL divergence 4.6123, 50 iterations in 3.7709 sec
Iteration  200, KL divergence 4.5207, 50 iterations in 1.6480 sec
Iteration  250, KL divergence 4.5206, 50 iterations in 1.8959 sec
   --> Time elapsed: 13.11 seconds
===> Running optimization with exaggeration=1, lr=1666.6666666666667 for 500 iterations...
Iteration   50, KL divergence 4.2664, 50 iterations in 1.9638 sec
Iteration  100, KL divergence 4.2130, 50 iterations in 1.4563 sec
Iteration  150, KL divergence 4.1957, 50 iterations in 1.4923 sec
Iteration  200, KL divergence 4.1883, 50 iterations in 1.8221 sec
Iteration  250, KL divergence 4.1840, 50 iterations in 1.8193 sec
Iteration  300, KL divergence 4.1813, 50 iterations in 1.8282 sec
Iteration  350, KL divergence 4.1795, 50 iterations in 1.8237 sec
Iteration  400, KL divergence 4.1782, 50 iterations in 1.8112 sec
Iteration  450, KL divergence 4.1772, 50 iterations in 1.8842 sec
Iteration  500, KL divergence 4.1765, 50 iterations in 1.7964 sec
   --> Time elapsed: 17.70 seconds

On exactly the same data Z = fast_tsne(X) slows down after ~150 iterations because of the runaway points. I am using updated openTSNE and updated FIt-SNE (my local brunch) so learning rate is n/12 and all other parameters should be the same.

I have no idea why this difference would arise!

Update: I do see large gradients in openTSNE too, e.g. looking at the embedding after 250 iterations. (With d=10 it does not happen.) It's just that it does not become just as large as in FIt-SNE.

dkobak commented 4 years ago

Some additional insight: the points that are shooting off are exactly the points that have very large number of non-zero elements in the affinity matrix (i.e. feel a large number of attractive forces). Here is the result from openTSNE after only 10 iterations (d=1000) with default params. One would think that the number of attractive forces per point is around 90, but actually the max is 2700 (!!) and there are 700 points with >500 attractive forces. Here those 700 are colored in blue; all >19000 other ("normal") points are colored in red:

tmp2

linqiaozhi commented 4 years ago

the points that are shooting off are exactly the points that have very large number of non-zero elements in the affinity matrix (i.e. feel a large number of attractive forces).

Very interesting. That is the key observation. Their attractive force is so large that exaggerating it causes them to overshoot.

Let's consider the graph before symmetrization, this a directed graph. Each point connects to its k nearest neighbors, i.e. there are k outgoing edges from each point. In high dimensions, we are observing that the distribution of the number of incoming connections is far from uniform. That is, there are some points that have a huge number of incoming edges.

There are obvious examples of point configurations where this is possible. Like 5 points can be constructive to have a "cross" structure, and you form a k=1 nearest neighbor graph, the point in the center will have 4 incoming edges and the others will have at most 1.

It's just not immediately obvious to me why this happens in high dimensions.

I'll think more about it tomorrow. I think we are running up against one of those interesting and non-intuitive facts about high dimensional space. It also might have to do with the fact that the data is Gaussian, and particularly, how we are scaling the variance as we increase the dimensions.

dkobak commented 4 years ago

Yes. I agree that this is very interesting!

It seems that this is not so much about the n/12 heuristic though. And one does not need to have two Gaussians actually. Here are simply 1000 points from a 1000-dimensional standard Gaussian. In this case the learning rate is 200. Still, look what happens during the early exaggeration phase:

tmp

It never goes much out of [-100, 100] range so FFT works reasonably fast. Still, clearly there is exactly the same phenomenon going on.

dkobak commented 4 years ago

I will also come back to this tomorrow. Just wanted to add that we should separate two questions: (1) why does this all happen and what do we learn from it; and (2) how should it affect this PR and the next release. Whereas I am very interested to keep investigating Q1, it'd be good to have some resolution to Q2 so that the PR/release could go ahead.

Given that the same thing (perhaps less drastic) can happen for smaller sample sizes with learning rate = 200, I am wondering if there is some hack that could mitigate this problem. E.g. clip the gradient norms (openTSNE allows it via max_grad_norm parameter that is by default switched off). Or do something with the momentum or delta-bar-delta gains. Ideally this should not affect "normal" uses cases like MNIST or RNAseq data after PCA, but prevent these wild oscillations of kNN graph "hubs" in high dimensions.

Update: I tried it out quickly, and max_grad_norm=0.01 removes large oscillations in my 1000 points in 1000 dimensions example, but does not seem to affect the MNIST embedding with default settings (after PCA down to 50d, like I always do).

dkobak commented 4 years ago

Ran some further analysis this morning. Here is max grad norm (max over all N points) during MNIST embedding with default params. Never grows above 0.0001 and most of the time is more like 1e-5 or 1e-6
tmp4

Regarding Q1 above, I checked that the same happens with exact t-SNE (so it's not due to any approximations; the dense NxN perplexity-calibrated affinity matrix displays the same behaviour). Second, I quickly ran UMAP on the same N=20000 and d=2, d=10, d=100 toy example as above, and observed qualitatively the same differences between the final embeddings.

tmp3

My conclusion from all this is that we discovered a very real dimensionality effect that affects all kNN embeddings. And as far as this PR is concerned, we should try to stabilize the gradient descent so that points do not overshoot like crazy.

@pavlin-policar might have some experience with max_grad_norm.

pavlin-policar commented 4 years ago

As George pointed out, the different levels of separation are due to the high dimensionality of the data. As the dimensionality of the data increases, the distances of points sampled from a fairly some distribution on a unit ball all become about the same. In the picture below, I plot the distribution of pairwise distances between points sampled from two Gaussians in different distributions. This is apparent by the ever spikier peaks in the distances. As the dimensionality increases, the separation between the two peaks also increases, so the distances are also higher. This is an inherent problem of using euclidean metrics, which is why I generally go for something like cosine distance instead (but I generally do PCA first, which preserves euclidean distances, but not cosine distances, and I don't know the effects of that).

Code

```python fig, ax = plt.subplots(ncols=4, figsize=(16, 3)) for idx, dims in enumerate([2, 10, 100, 1000]): x1 = np.random.normal(+2, 1, size=(1000, dims)) x2 = np.random.normal(-2, 1, size=(1000, dims)) xs = np.vstack((x1, x2)) dists = squareform(pdist(xs, metric="cosine")).ravel() ax[idx].hist(dists, bins=100, color="skyblue") ax[idx].set_title(f"{xs.shape[1]} dims") ax[idx].get_yaxis().set_ticks([]) ```

I calculated the connected components on the exact KNNG and it's 1 for 2 dims, and 2 for the other 3. No surprises there. It's not clear how the dimensionality would affect the affinity matrix. The distribution of p_{ij}s seems unaffected (top row below), while the hubs get bigger. Plotting 10, 100, and 1000 on the log scale reveals that they actually follow a power-law distribution.

Code

```python fig, ax = plt.subplots(ncols=4, nrows=2, figsize=(16, 5)) for idx, dims in enumerate([2, 10, 100, 1000]): x1 = np.random.normal(+2, 1, size=(1000, dims)) x2 = np.random.normal(-2, 1, size=(1000, dims)) xs = np.vstack((x1, x2)) aff = affinity.PerplexityBasedNN(xs, method="exact") ax[0, idx].hist(aff.P.data, bins=50, color="skyblue") ax[1, idx].hist(np.ravel(np.sum(aff.P > 0, axis=0)), bins=50, color="skyblue") ax[0, idx].set_title(f"{xs.shape[1]} dims: pij") ax[1, idx].set_title(f"{xs.shape[1]} dims: #neighbors") ax[0, idx].get_yaxis().set_ticks([]) ax[1, idx].get_yaxis().set_ticks([]) plt.tight_layout() ```

I don't know how all this translates into gradients, but it should explain the clean separation that @dkobak was seeing.

Regarding some points offshooting: it would be a good idea to figure out which (attractive or repulsive) forces are causing this. When I was doing the transform functionality of t-SNE, this was a huge problem driven by the attractive forces, which were huge, causing points to offshoot. There, I solved it with gradient clipping or reducing learning rate -- both fixed the problem.

This was also a problem when not-scaling the initialization properly, but I don't remember which of the forces were problematic there. When it's not properly initialized, things don't converge. So my intuition is that having the scaled initialization somehow perfectly balances the attractive and repulsive forces, and I would guess the momentum compensates for that as the embedding is scaled up. But I haven't tested this.

pavlin-policar commented 4 years ago

I would also note that I do gradient clipping on a per-point basis. This prevents individual points from offshooting, which is very problematic in the interpolation scheme.

dkobak commented 4 years ago

@pavlin-policar I posted a comment almost simultaneously with yours (some seconds before); take a look, in case you didn't see it.

As George pointed out, the different levels of separation are due to the high dimensionality of the data. As the dimensionality of the data increases, the distances of points sampled from a fairly some distribution on a unit ball all become about the same.

This is true but to be honest I think it's unrelated (or at least not directly related) to what we are seeing here. In each dimensionality from d=2 to d=1000 the separation between Gaussians is so wide that there are no kNN edges between clusters. So as far as tSNE is concerned the clusters are infinitely far away, independent of d.

The distribution of p_{ij}s seems unaffected (top row below), while the hubs get bigger. Plotting 10, 100, and 1000 on the log scale reveals that they actually follow a power-law distribution.

Yeah, this seems to be the key insight.

Regarding some points offshooting: it would be a good idea to figure out which (attractive or repulsive) forces are causing this.

Everything indicates it's attractive forces. (1) Only happens during exaggeration. (2) Only happens for points that have many nonzero elements in the affinity matrix ("hubs").

When I was doing the transform functionality of t-SNE, this was a huge problem driven by the attractive forces, which were huge, causing points to offshoot. There, I solved it with gradient clipping or reducing learning rate -- both fixed the problem.

Exactly, which is why I am thinking about gradient clipping by default. The question is, what would be a good default value?

I would also note that I do gradient clipping on a per-point basis. This prevents individual points from offshooting, which is very problematic in the interpolation scheme.

You mean when doing transform? Because openTSNE does not use clipping by default for tSNE itself, I think, right?

pavlin-policar commented 4 years ago

This is true but to be honest I think it's unrelated (or at least not directly related) to what we are seeing here. In each dimensionality from d=2 to d=1000 the separation between Gaussians is so wide that there are no kNN edges between clusters. So as far as tSNE is concerned the clusters are infinitely far away, independent of d.

Well, yes, you're right. But it would explain why the clusters seem more compact. If all the points are about the same distance from each other, then which neighbors are chosen is kind of arbitrary. I would guess the KNNG is much denser in 1k dims than in 10 dims. And the p_ijs are probably more uniformly distributed for each neighbor. So, more compact. Then the repulsive forces are more concentrated and therefore stronger. Also, the level of separation is a bit visual. Notice that in your picture below, the span of the embedding is largest in 2d, from -50 to 50, while it only spans from -10 to 10 in the 100d embedding.

You mean when doing transform? Because openTSNE does not use clipping by default for tSNE itself, I think, right?

Right. Up until now, I have never seen points offshooting when running regular t-SNE, so it seemed unnecessary. I did encounter this once when setting the learning rate really high, but I don't think most people touch the learning rate, other than us, apparently.

dkobak commented 4 years ago

If all the points are about the same distance from each other, then which neighbors are chosen is kind of arbitrary. I would guess the KNNG is much denser in 1k dims than in 10 dims.

Right! Yes. I like this way of putting it.

Up until now, I have never seen points offshooting when running regular t-SNE, so it seemed unnecessary. I did encounter this once when setting the learning rate really high, but I don't think most people touch the learning rate, other than us, apparently.

Right. Well, as I showed in the animation above some points do offshoot with the default learning rate in d=1000. The embedding still works fine in that case, but it does indicate the problem. And with learning rate = n/12 it would make sense to use some gradient clipping.

I guess the main question is, what would be a reasonable value?

dkobak commented 4 years ago

I guess the main question is, what would be a reasonable value?

So based on my experiments reported above I would cautiously suggest max_grad_norm=0.01 by default. It should be easy to implement in FIt-SNE. But we need to test it somehow.

dkobak commented 4 years ago

So what do you think guys? I feel this issue is now holding up releases of both openTSNE and FIT-SNE, so let's try to address it somehow.

If you think gradient clipping is a promising way forward, I can try to plot max gradient norms per iteration for some other datasets like e.g. Tasic2018 or 10x 1.3 mln. @pavlin-policar Have you ever looked at such plots for other data?

linqiaozhi commented 4 years ago

In terms of moving forward with this pull request, the most conservative course of action would be to hold off on changing the default max_iter and learning rate until we can better investigate this phenomenon and potential solutions. But given that it only seems to occur in high dimensions (higher than most t-SNE applications), and gradient clipping seems to be an existing and safe solution, I think it makes sense to go with that.

I am actually not too familiar with gradient clipping. Just to be clear, when you say "max_grad_norm=0.01" is that per point? Does it mean that the euclidean norm of the "update" at each iteration (i.e. after multiplying the gradient by the learning rate) would be limited to magnitude of 0.01? i.e. if it was any larger, we would rescale it to 0.01.

If you think gradient clipping is a promising way forward, I can try to plot max gradient norms per iteration for some other datasets like e.g. Tasic2018 or 10x 1.3 mln.

I think the gradient clipping should be set up so that it only "kicks in" for these pathologic examples where points are bouncing around like crazy. So, yes, it would be really helpful to see the max gradient norms (particularly during exaggeration) for some real datasets. If you have time to do it, it would give us a feel for an appropriate choice. Hopefully we'll see that it's very very small, so something like 0.01 won't kick in for any real dataset.

dkobak commented 4 years ago

In terms of moving forward with this pull request, the most conservative course of action would be to hold off on changing the default max_iter and learning rate until we can better investigate this phenomenon and potential solutions. But given that it only seems to occur in high dimensions (higher than most t-SNE applications), and gradient clipping seems to be an existing and safe solution, I think it makes sense to go with that.

That's exactly what I wanted to discuss, but yes I agree with this.

I am actually not too familiar with gradient clipping. Just to be clear, when you say "max_grad_norm=0.01" is that per point? Does it mean that the euclidean norm of the "update" at each iteration (i.e. after multiplying the gradient by the learning rate) would be limited to magnitude of 0.01? i.e. if it was any larger, we would rescale it to 0.01.

Yes, per point, but I was referring to the gradient itself, before multiplying by the learning rate or gains. OK, I will make some more plots and post here.

linqiaozhi commented 4 years ago

Yes, per point, but I was referring to the gradient itself, before multiplying by the learning rate.

Shouldn't the limit be on the entire step, including the learning rate and the exaggeration? Otherwise too large of a learning rate (like in the examples above) could still lead to pathologic jumping, right?

I understand that the fundamental problem in the examples above is that the gradient is too large (because the attractive force is too strong), but it only manifested because we increased the learning rate. That's why I wonder if our clipping should take the learning rate into account. Is that not typically done in other applications?

dkobak commented 4 years ago

No idea. I saw that in @pavlin-policar's implementation max_grad_norm clipping is applied to the gradient itself. I guess one could alternatively apply it to the whole update step, but a lot of other things are influencing it (learning rate, momentum, delta-bar-delta gains). I don't really have an intuition about what is better (but certainly the reasonable cutoff values would be be different).

@pavlin-policar What do you think?

pavlin-policar commented 4 years ago

If you think gradient clipping is a promising way forward, I can try to plot max gradient norms per iteration for some other datasets like e.g. Tasic2018 or 10x 1.3 mln. @pavlin-policar Have you ever looked at such plots for other data?

I haven't looked at these plots, but it should be fairly straightforward to do. Let's find a conservative value that never kicks in for any standard learning rate. If you could run this on a couple of data sets that would be ideal. It would also be good to benchmark this against having it turned off. The time difference probably won't be major, but if it is, I may even prefer to turn it off by default since this is the first time we've ever run across this.

I am actually not too familiar with gradient clipping. Just to be clear, when you say "max_grad_norm=0.01" is that per point?

Yes, this is per-point. The code is also very simple, we just rescale the gradients to some given norm if needed.

norm = np.linalg.norm(gradient, axis=1)
coeff = max_grad_norm / (norm + 1e-6)
mask = coeff < 1
gradient[mask] *= coeff[mask, None]

I understand that the fundamental problem in the examples above is that the gradient is too large (because the attractive force is too strong), but it only manifested because we increased the learning rate. That's why I wonder if our clipping should take the learning rate into account. Is that not typically done in other applications?

When adding this, I looked at what pytorch does, since gradient clipping is quite common in recurrent nets. It rescales the gradient before applying learning rate or momentum. I guess this is standard. It makes sense to do it before applying momentum because then that could compensate if the gradient clipping value was too aggressive I suppose.

dkobak commented 4 years ago

@pavlin-policar: A few small things to clarify in what you wrote:

Let's find a conservative value that never kicks in for any standard learning rate.

If I'm looking at the gradient norm, then it does not depend on the learning rate, right?

If you could run this on a couple of data sets that would be ideal. It would also be good to benchmark this against having it turned off. The time difference probably won't be major, but if it is, I may even prefer to turn it off by default since this is the first time we've ever run across this.

Not sure what time difference you mean. With and without gradient clipping? But if the idea is to find a cutoff that is never triggered in "standard" situations, then there will be no difference at all, right?

(BTW: It could make sense to make gradient a class member in openTSNE, so that there could be a callback returning gradients and gains... For now I am doing a quick hack to grab the max gradient value, but that would be a nicer setup.)

pavlin-policar commented 4 years ago

If I'm looking at the gradient norm, then it does not depend on the learning rate, right?

Yes.

But if the idea is to find a cutoff that is never triggered in "standard" situations, then there will be no difference at all, right?

Not exactly, we're still adding extra code that needs to execute in every iteration. The first three lines of the code snippet I posted above will still always have to be evaluated on all the points, so that will definitely affect runtime. The question is, is this time difference noticeable?

It could make sense to make gradient a class member in openTSNE, so that there could be a callback returning gradients and gains... For now I am doing a quick hack to grab the max gradient value, but that would be a nicer setup.

Do you mean something like what pytorch does with variables? It's an interesting idea, but I don't think it fits into the framework. For stuff like this, I think modifying the code in some hacky way is fine. I can't think of a single reason (besides this) when the gradient would be accessible to users.

dkobak commented 4 years ago

I'm running these experiments now. To save time, I won't be making pretty figures. I am looking at the FIt-SNE terminal output and copying here the max grad norm value. All experiments use learning_rate=n/12 and PCA init as in this PR.

MNIST (PCA down to 50): max grad norm 0.00014 at iter=36 -- this corresponds to the figure I posted above MNIST (no PCA, so dim=784): max grad norm 0.00016 at iter=30 Tasic (PCA down to 50): max grad norm 0.00067 at iter=36 10x 1.3 mln (PCA down to 50): max grad norm 0.00004 at iter=60

Based on this I saw that it's larger with smaller sample sizes, so I tried random subsets of MNIST:

N=10k: max grad norm 0.00045 at iter=36 N=5k: 0.0012 at iter=41 N=3k: 0.0018 at iter=43 N=1k: 0.00001 at iter=4

Conclusion: I did not see max grad norm above 0.002, and most of the times grad norm is much lower than that. These max values only happen during several iterations, and then the norm drops again by an order of magnitude or so.

I tried the same with N=1000 standard Gaussian points in d=1000 dimensions. I am getting gradients as high as 0.15, and they don't go down but just stay at this value. So here clipping at 0.01 works well.

Similar thing happens in George's example with N=20000 in two Gaussians. However, here the max gradients are around 0.01. So clipping them to 0.01 actually does not alleviate the problem in this case... If I clip them to 0.001 then it works fine and points don't overshoot too much; 0.005 also works but worse.

With N=40000 points, max gradients are around 0.004. Clipping them to 0.001 works fine.

OK, this is a bit complicated. It seems George incidentally found an optimization loophole. For large N and low latent dimensionality, we need high learning rate because otherwise the convergence is very slow. That's why we scale the learning rate as N/12. But for large N and high latent dimensionality, this backfires because of the kNN "hubs" (and FFT approximation slowing down with embedding size). I don't see a good principled way out of it right now. I think we can clip the gradients at 0.001 and everything should work fine in all of the experiments reported so far in this thread, but it might be that N=1mln with D=1000 (with Euclidean dist) would still run into the same problem. Not sure who would use such a dataset though.

linqiaozhi commented 4 years ago

Thanks for doing the experiments, that's really helpful.

Given that this problem is so dependent on the learning rate, it makes sense to me that we should be clipping based on the magnitude after the learning rate is applied. What we need is pretty clear: we need to limit how far a point can move at each iteration, because some points are shooting off and slowing down the interpolation scheme. If we limit it to 1, for example, wouldn't that more or less solve the problem? Based on your experiment, it seems like it would be easier to find a reasonable threshold that is not dependent on N by clipping based on after the learning rate is applied (since the learning rate is dependent on N).

Another option (either in addition to, or lieu of the clipping) is to check at every 50 iterations whether or not this overshooting is happening (e.g. if points are moving more than 1) and to just give a warning to the user that the learning rate is probably too high, and that they might consider using the more conservative learning rate of 200.

dkobak commented 4 years ago

If we limit it to 1, for example, wouldn't that more or less solve the problem?

Okay, so I looked at the steps. They actually don't decrease so much during the course of optimization, presumably because the gains compensate for decreasing gradients. On MNIST, I get max step of around 1.1 around iter=30 (when the gradient is maximal), then it drops to around 0.5 and stays there. On 5k subset of MNIST, max step is around 0.7. On the n=1.3mln dataset, max step size is around 3.5 in the early iterations, then drops to around 1.5.

With N=1000 in d=1000, I get step sizes in the 20--50 range. With N=20000 in d=1000, I get step sizes in the same range.

OK George, I like this idea. I implemented the clipping of the step size with threshold 5, and it resolves all problems with d=1000. Should I add it to this PR? If so, should we add this threshold value to (max_step_norm) to the wrappers? I'd say we should, in the spirit of having all parameters controllable via the wrappers.

linqiaozhi commented 4 years ago

Should I add it to this PR? If so, should we add this threshold value to (max_step_norm) to the wrappers? I'd say we should, in the spirit of having all parameters controllable via the wrappers.

Sounds great to me. I will then continue making these changes in the R and MATLAB implementations.

Once this is settled, we should talk more about this phenomenon. I wonder if it is an additional argument for reducing dimensionality using PCA prior to doing t-SNE or UMAP embeddings.

dkobak commented 4 years ago

OK, I pushed it into the PR. Note that the data.dat format changed because of the additional parameter. Everything seems to work, but it's late here, so we should test a bit more.

linqiaozhi commented 4 years ago

One more thing: irlba is initialized with a random vector, so as of right now the resulting embedding when initialization='pca' is not actually reproducible. I was going to modify it so that if the user provides a seed, then that same seed is also used immediately prior to the call to PCA.

EDIT: by irlba I mean arpack, in python

dkobak commented 4 years ago

Good point. Let's add the seed to the PCA() constructor:

pca = PCA(n_components=map_dims, svd_solver=solver, random_state=seed if seed!=-1 else None)

dkobak commented 4 years ago

sorry, had a typo in the code above, edited now.

linqiaozhi commented 4 years ago

Okay, MATLB and R wrappers are be updated. If no other changes, I'll change the version numbers in each of the wrappers to 1.2.0 and make a new release!

dkobak commented 4 years ago

The PCA argument in Python should be random_state, not random_seed (I wrote it wrong initially).

dkobak commented 4 years ago

Other than that, it looks good to me. I can look a bit closer later today.

@pavlin-policar What do you think about the max_step_norm=5 approach that we took?

dkobak commented 4 years ago

@linqiaozhi Stupid question: how does one add commits to somebody else's PR? You added two commits to this PR or mine; how does it work?

linqiaozhi commented 4 years ago

@linqiaozhi Stupid question: how does one add commits to somebody else's PR? You added two commits to this PR or mine; how does it work?

As far as I understand, the PR is from dkobak/FIt-SNE:master to KlugerLab/FIt-SNE:master. So, any changes to dkobak/FIt-SNE:master will be included in the PR.

So I cloned dkobak/FIt-SNE, made my changes, and then pushed to it. Because it is a branch of a repository I have push access to, by default I can push to it as well.

dkobak commented 4 years ago

So I cloned dkobak/FIt-SNE, made my changes, and then pushed to it. Because it is a branch of a repository I have push access to, by default I can push to it as well.

Oh! I didn't realize that. Actually, if I look at the settings of dkobak/FIt-SNE repository, the Github tells me that "0 collaborators have access to this repository. Only you can contribute to this repository." But I can see your commits in there.

Do you have push access to all forks of klugerlab/FIt-SNE?

linqiaozhi commented 4 years ago

No, only because you did a PR from it. There is a checkbox to enable/disable this when you make the PR. See here.

dkobak commented 4 years ago

Oh that's smart from Github! :-)

I fixed the random_state, and also put the whole .py wrapper through the Black formatter. This will result in a large diff, but I thought it's worth it because it's considered to be a good Python code style.

pavlin-policar commented 4 years ago

What do you think about the max_step_norm=5 approach that we took?

TBH, I don't like it very much. 5 seems very arbitrary and this might not work for cases we haven't tested. A better way to set this would be to look at the distribution of step sizes and figure out something from that, something like 2*median_step or something. I'll add this option to openTSNE just to be consistent with FIt-SNE, but I'll disable it by default. So in case, anybody needs it, it's there, but I don't think this is a common occurrence, and shouldn't be a problem.

Something that worries me is that this feels very much like a band-aid to some underlying problem. It might be worth looking at how momentum acts here, maybe the momentum is too high? Or maybe the learning rate shouldn't just depend on N, but also on the number of dimensions. The list of parameters is already quite long, and slapping another one on feels like a dirty fix.

dkobak commented 4 years ago

@pavlin-policar

Something that worries me is that this feels very much like a band-aid to some underlying problem. It might be worth looking at how momentum acts here, maybe the momentum is too high? Or maybe the learning rate shouldn't just depend on N, but also on the number of dimensions. The list of parameters is already quite long, and slapping another one on feels like a dirty fix.

I agree with this. On the other hand, limiting max allowed step size seems to be a pretty natural and quite intuitive restriction... One never wants points during t-SNE optimization to move by a lot on each iteration. Given that the final embedding size is usually on the scale of ~100 and that one usually uses 750-1000 steps, it feels fine to restrict the max step by something around ~1.

5 seems very arbitrary and this might not work for cases we haven't tested.

I agree. Based on my experiments above, I think 1 or 10 could work just as well. The good thing is that limiting the step size should not really break anything; the worst that can happen is that the convergence will slow down.

A better way to set this would be to look at the distribution of step sizes and figure out something from that, something like 2*median_step or something.

You mean doing this on each iteration? That's an interesting idea too.

Not exactly, we're still adding extra code that needs to execute in every iteration. The first three lines of the code snippet I posted above will still always have to be evaluated on all the points, so that will definitely affect runtime. The question is, is this time difference noticeable?

To me it felt negligible. We only compute row sums for a Nx2 array. It's very fast.

Do you mean something like what pytorch does with variables? It's an interesting idea, but I don't think it fits into the framework. For stuff like this, I think modifying the code in some hacky way is fine. I can't think of a single reason (besides this) when the gradient would be accessible to users.

Not sure what you meant here (not familiar with pytorch). I just meant that I wanted to write a custom callback that would give me gradient norm at each iteration and realized that I cannot do that because the gradient is only defined within one of the functions.

linqiaozhi commented 4 years ago

@pavlin-policar I agree that this fix feels arbitrary. But it seems quite safe (as Dmitry pointed out), and as far as we can tell, probably will only be applied in rare circumstances. More than happy to improve upon it as we learn more about this overshooting effect.

I think the increase in learning rate (that is causing this problem) is quite important to users. I believe people are getting poor results on their t-SNEs of huge datasets because the default 1000 iters with learning rate of 200 is simply inadequate.

So, I'd like to not hold back on this release, but rather update if we have a better solution later.

KlugerLab / FIt-SNE

Changing the defaults in the Python wrapper #93