Nearest Neighbor edges are not consistent between runs

jstankowicz commented 3 years ago

Hello, Running pymde.preserve_neighbors does not always produce the same set of edges as stored in mde.edges. Here is a MWE:

import pymde
import torch
n_examples=15 # Note results become reproducible with n_examples=14
embedding_dim = 2
original_dim = 128
X = torch.rand((n_examples,original_dim))
collect = []
for i in range(5):
    mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)
    collect.append(mde.edges)
    if i > 0:
        print(f"Trial {i} matches trial {i-1}")
        print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
        print()

This prints:

Trial 1 matches trial 0
tensor(False)

Trial 2 matches trial 1
tensor(False)

Trial 3 matches trial 2
tensor(False)

Trial 4 matches trial 3
tensor(False)

Note per the comment above, if I switch to n_examples = 14, I get that all elements of collect are identical.

I haven't been able to fully trace the issue, but did note that the neighbors variable defined here is consistent on each trial, so my best guess is there is something strange happening in the Graph class here.

akshayka commented 3 years ago

Thanks for reporting, and for the great MWE! I'll take a look.

akshayka commented 3 years ago

@jstankowicz,

I took a look.

You can control the determinism via setting a couple random seeds. ~~But in this case I believe the computation should be deterministic because the neighbors are computed exactly ... so I'll look into it further.~~

By default pymde.preserve_neighbors randomly samples non-neighbors of each item, and adds these non-neighbors as edges to the graph being embedded (they are associated with repulsive distortion functions).

Right now PyMDE doesn't have an API for setting the random seed (I plan to add one, see issue #43). For now, you can get deterministic behavior by setting two random seeds (one for PyTorch, one for NumPy) before calling preserve_neighbors:

import pymde
import torch
import numpy as np

n_examples=15 # Note results become reproducible with n_examples=14
embedding_dim = 2
original_dim = 128
X = torch.rand((n_examples,original_dim))
collect = []
for i in range(5):
    torch.manual_seed(0)
    np.random.seed(0)
    mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)
    collect.append(mde.edges)
    if i > 0:
        print(f"Trial {i} matches trial {i-1}")
        print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
        print()

prints

Trial 1 matches trial 0
tensor(True)

Trial 2 matches trial 1
tensor(True)

Trial 3 matches trial 2
tensor(True)

Trial 4 matches trial 3
tensor(True)

Hope that helps!

akshayka commented 3 years ago

Alternatively, you can disable the inclusion of repulsive distortion functions, which will make the neighborhood computation deterministic.

Note the kwarg repulsive_penalty below. Setting it to None disables repulsion.

n_examples=15 # Note results become reproducible with n_examples=14              
embedding_dim = 2                                                                
original_dim = 128                                                               
X = torch.rand((n_examples,original_dim)).double()                               
collect = []                                                                     
for i in range(5):                                                               
    mde = pymde.preserve_neighbors(X, embedding_dim=embedding_dim,               
            repulsive_penalty=None)                                              
    collect.append(mde.edges)                                                    
    if i > 0:                                                                    
        print(f"Trial {i} matches trial {i-1}")                                  
        print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
        print()

jstankowicz commented 3 years ago

Thanks for taking a look at this! Like you guessed, I was looking into this for random-seed-setting purposes. Interestingly the random-seed-setting doesn't work for 36 examples:

import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
n_examples=36
embedding_dim = 2
original_dim = 128
collect = []
def set_seed(i):
    torch.manual_seed(i)
    np.random.seed(i)
X = torch.rand((n_examples,original_dim))
for i in range(5):
    set_seed(0)
    mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)#, repulsive_penalty=None)
    collect.append(mde.edges)
    if i > 0:
        print(f"Trial {i} matches trial {i-1}")
        print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
        print()

returns:

Trial 1 matches trial 0
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_13917/3102527318.py in <module>
     15     if i > 0:
     16         print(f"Trial {i} matches trial {i-1}")
---> 17         print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
     18         print()

RuntimeError: The size of tensor a (226) must match the size of tensor b (223) at non-singleton dimension 0

If I traced it through correctly, looks like that might result from swapping from sklearn (for 35 and fewer examples) to pynndescent (for more than 35 examples), and pynndescent can return differing numbers of nearest neighbors (although why the cutoff is 35 examples is unclear to me).

At any rate the pymde.preserve_neighbors(...,repulsive_penalty=None,...) option works for me:

import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
n_examples=36
embedding_dim = 2
original_dim = 128
collect = []
X = torch.rand((n_examples,original_dim))
for i in range(5):
    mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim, repulsive_penalty=None)
    collect.append(mde.edges)
    if i > 0:
        print(f"Trial {i} matches trial {i-1}")
        print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
        print()

returns

Trial 1 matches trial 0
tensor(True)

Trial 2 matches trial 1
tensor(True)

Trial 3 matches trial 2
tensor(True)

Trial 4 matches trial 3
tensor(True)

jstankowicz commented 3 years ago

Follow up: upon actually looking at the embeddings, setting repulsive_penalty=None doesn't give a very good representation of the randomly distributed data compared to the default case.

Default case:

import plotly.graph_objects as go
import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
import plotly.graph_objects as go
n_examples=100
embedding_dim = 2
original_dim = 128
collect = []
X = torch.rand((n_examples,original_dim))
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)#, repulsive_penalty=None)
collect.append(mde.edges)
embedding = mde.embed().cpu().numpy()
go.Figure(go.Scattergl(x=embedding[:,0],y=embedding[:,1],mode="markers"))

Removing repulsive_penalty case:

import plotly.graph_objects as go
import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
import plotly.graph_objects as go
n_examples=100
embedding_dim = 2
original_dim = 128
collect = []
X = torch.rand((n_examples,original_dim))
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim, repulsive_penalty=None)
collect.append(mde.edges)
embedding = mde.embed().cpu().numpy()
go.Figure(go.Scattergl(x=embedding[:,0],y=embedding[:,1],mode="markers"))

akshayka commented 3 years ago

@jstankowicz, thanks so much for the follow up, and sorry for the late reply! I was out on vacation.

I've created a fix (https://github.com/cvxgrp/pymde/pull/45) which I believe solves this issue.

The above PR adds a function, pymde.seed, which lets you set a global random seed. (It sets the NumPy and torch global seeds, and initializes a NumPy generator for later use by internal PyMDE code. This last bit was required to get your example working.)

To set the seed to 0, you'd just call

pymde.seed(0)

Does that API look good to you?

jstankowicz commented 3 years ago

No worries, thanks for diving into this! I pulled the random_seed branch, and confirmed that pymde.seed(0) indeed resolves all the issues above. Thanks!

akshayka commented 3 years ago

Great! Thanks again for raising the issue and providing helpful code snippets.

I'll merge the PR soon and cut a new release. Once that's done, I'll close this issue.

cvxgrp / pymde

Nearest Neighbor edges are not consistent between runs #42