Closed jstankowicz closed 3 years ago
Thanks for reporting, and for the great MWE! I'll take a look.
@jstankowicz,
I took a look.
You can control the determinism via setting a couple random seeds. But in this case I believe the computation should be deterministic because the neighbors are computed exactly ... so I'll look into it further.
By default pymde.preserve_neighbors
randomly samples non-neighbors of each item, and adds these non-neighbors as edges to the graph being embedded (they are associated with repulsive distortion functions).
Right now PyMDE doesn't have an API for setting the random seed (I plan to add one, see issue #43). For now, you can get deterministic behavior by setting two random seeds (one for PyTorch, one for NumPy) before calling preserve_neighbors
:
import pymde
import torch
import numpy as np
n_examples=15 # Note results become reproducible with n_examples=14
embedding_dim = 2
original_dim = 128
X = torch.rand((n_examples,original_dim))
collect = []
for i in range(5):
torch.manual_seed(0)
np.random.seed(0)
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)
collect.append(mde.edges)
if i > 0:
print(f"Trial {i} matches trial {i-1}")
print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
print()
prints
Trial 1 matches trial 0
tensor(True)
Trial 2 matches trial 1
tensor(True)
Trial 3 matches trial 2
tensor(True)
Trial 4 matches trial 3
tensor(True)
Hope that helps!
Alternatively, you can disable the inclusion of repulsive distortion functions, which will make the neighborhood computation deterministic.
Note the kwarg repulsive_penalty
below. Setting it to None
disables repulsion.
n_examples=15 # Note results become reproducible with n_examples=14
embedding_dim = 2
original_dim = 128
X = torch.rand((n_examples,original_dim)).double()
collect = []
for i in range(5):
mde = pymde.preserve_neighbors(X, embedding_dim=embedding_dim,
repulsive_penalty=None)
collect.append(mde.edges)
if i > 0:
print(f"Trial {i} matches trial {i-1}")
print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
print()
Thanks for taking a look at this! Like you guessed, I was looking into this for random-seed-setting purposes. Interestingly the random-seed-setting doesn't work for 36 examples:
import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
n_examples=36
embedding_dim = 2
original_dim = 128
collect = []
def set_seed(i):
torch.manual_seed(i)
np.random.seed(i)
X = torch.rand((n_examples,original_dim))
for i in range(5):
set_seed(0)
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)#, repulsive_penalty=None)
collect.append(mde.edges)
if i > 0:
print(f"Trial {i} matches trial {i-1}")
print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
print()
returns:
Trial 1 matches trial 0
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_13917/3102527318.py in <module>
15 if i > 0:
16 print(f"Trial {i} matches trial {i-1}")
---> 17 print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
18 print()
RuntimeError: The size of tensor a (226) must match the size of tensor b (223) at non-singleton dimension 0
If I traced it through correctly, looks like that might result from swapping from sklearn
(for 35 and fewer examples) to pynndescent
(for more than 35 examples), and pynndescent
can return differing numbers of nearest neighbors (although why the cutoff is 35 examples is unclear to me).
At any rate the pymde.preserve_neighbors(...,repulsive_penalty=None,...)
option works for me:
import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
n_examples=36
embedding_dim = 2
original_dim = 128
collect = []
X = torch.rand((n_examples,original_dim))
for i in range(5):
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim, repulsive_penalty=None)
collect.append(mde.edges)
if i > 0:
print(f"Trial {i} matches trial {i-1}")
print((torch.sort(collect[i],0).values == torch.sort(collect[i-1],0).values).all())
print()
returns
Trial 1 matches trial 0
tensor(True)
Trial 2 matches trial 1
tensor(True)
Trial 3 matches trial 2
tensor(True)
Trial 4 matches trial 3
tensor(True)
Follow up: upon actually looking at the embeddings, setting repulsive_penalty=None
doesn't give a very good representation of the randomly distributed data compared to the default case.
Default case:
import plotly.graph_objects as go
import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
import plotly.graph_objects as go
n_examples=100
embedding_dim = 2
original_dim = 128
collect = []
X = torch.rand((n_examples,original_dim))
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim)#, repulsive_penalty=None)
collect.append(mde.edges)
embedding = mde.embed().cpu().numpy()
go.Figure(go.Scattergl(x=embedding[:,0],y=embedding[:,1],mode="markers"))
Removing repulsive_penalty
case:
import plotly.graph_objects as go
import pymde
import torch
import numpy as np
from graph_embeddings.tools import set_random_seed
import plotly.graph_objects as go
n_examples=100
embedding_dim = 2
original_dim = 128
collect = []
X = torch.rand((n_examples,original_dim))
mde = pymde.preserve_neighbors(X,embedding_dim=embedding_dim, repulsive_penalty=None)
collect.append(mde.edges)
embedding = mde.embed().cpu().numpy()
go.Figure(go.Scattergl(x=embedding[:,0],y=embedding[:,1],mode="markers"))
@jstankowicz, thanks so much for the follow up, and sorry for the late reply! I was out on vacation.
I've created a fix (https://github.com/cvxgrp/pymde/pull/45) which I believe solves this issue.
The above PR adds a function, pymde.seed
, which lets you set a global random seed. (It sets the NumPy and torch global seeds, and initializes a NumPy generator for later use by internal PyMDE code. This last bit was required to get your example working.)
To set the seed to 0, you'd just call
pymde.seed(0)
Does that API look good to you?
No worries, thanks for diving into this! I pulled the random_seed
branch, and confirmed that pymde.seed(0)
indeed resolves all the issues above. Thanks!
Great! Thanks again for raising the issue and providing helpful code snippets.
I'll merge the PR soon and cut a new release. Once that's done, I'll close this issue.
Hello, Running
pymde.preserve_neighbors
does not always produce the same set of edges as stored inmde.edges
. Here is a MWE:This prints:
Note per the comment above, if I switch to
n_examples = 14
, I get that all elements ofcollect
are identical.I haven't been able to fully trace the issue, but did note that the
neighbors
variable defined here is consistent on each trial, so my best guess is there is something strange happening in theGraph
class here.