cvxgrp / pymde

Minimum-distortion embedding with PyTorch
https://pymde.org
Apache License 2.0
526 stars 27 forks source link

Adding or removing one sample results in absolutely different embeddings #50

Open cowjen01 opened 2 years ago

cowjen01 commented 2 years ago

My code is following:

pymde.seed(0)
mde = pymde.preserve_neighbors(
    matrix[: 1001], # matrix[: 1000]
    embedding_dim=2,
    init='random',
    device='cpu',
    constraint=pymde.Centered(),
    verbose=self.verbose
)
embeddings = mde.embed(verbose=self.verbose)
embeddings = embeddings.cpu().numpy()

When I use the first 1,000 samples from the input matrix I get a very different results then using one sample more (1,001).

Here is the log:

Feb 21 07:21:55 PM: Computing 5-nearest neighbors, with max_distance=None
Feb 21 07:21:55 PM: Exact nearest neighbors by brute force 
Feb 21 07:21:55 PM: Your dataset appears to contain duplicated items (rows); when embedding, you should typically have unique items.
Feb 21 07:21:55 PM: The following items have duplicates [261 262 264 385 394 490 521 542 547 592 715]
Feb 21 07:21:55 PM: Fitting a centered embedding into R^2, for a graph with 1001 items and 9562 edges.
Feb 21 07:21:55 PM: `embed` method parameters: eps=1.0e-05, max_iter=300, memory_size=10
Feb 21 07:21:55 PM: iteration 000 | distortion 0.773313 | residual norm 0.0166138 | step length 30.3 | percent change 1.09275
Feb 21 07:21:55 PM: iteration 030 | distortion 0.372009 | residual norm 0.00494183 | step length 1 | percent change 5.72445
Feb 21 07:21:55 PM: iteration 060 | distortion 0.305200 | residual norm 0.00271112 | step length 1 | percent change 3.55324
Feb 21 07:21:56 PM: iteration 090 | distortion 0.284056 | residual norm 0.00196794 | step length 1 | percent change 2.22588
Feb 21 07:21:56 PM: iteration 120 | distortion 0.277153 | residual norm 0.000870837 | step length 1 | percent change 0.436913
Feb 21 07:21:56 PM: iteration 150 | distortion 0.275639 | residual norm 0.00086974 | step length 1 | percent change 1.04672
Feb 21 07:21:56 PM: iteration 180 | distortion 0.272377 | residual norm 0.00140454 | step length 1 | percent change 1.2704
Feb 21 07:21:56 PM: iteration 210 | distortion 0.269552 | residual norm 0.000706442 | step length 1 | percent change 0.560233
Feb 21 07:21:56 PM: iteration 240 | distortion 0.267543 | residual norm 0.00103134 | step length 1 | percent change 0.558733
Feb 21 07:21:56 PM: iteration 270 | distortion 0.265752 | residual norm 0.000605354 | step length 1 | percent change 0.259163
Feb 21 07:21:56 PM: iteration 299 | distortion 0.265053 | residual norm 0.000348569 | step length 1 | percent change 0.0578442
Feb 21 07:21:56 PM: Finished fitting in 0.660 seconds and 300 iterations.
Feb 21 07:21:56 PM: average distortion 0.265 | residual norm 3.5e-04

And here the output embeddings:

Screenshot 2022-02-21 at 19 26 46 Screenshot 2022-02-21 at 19 26 56

Is this an expected behaviour? I thought adding one sample should not makes as much difference.

Thank you for helping me out!

akshayka commented 2 years ago

It depends how close the new sample is on average to the first 1000 samples. If it's a nearest neighbor of many of the original samples, then the embedding may look a bit different.

A few things you can try:

If you give me access to the data I can play around with your example when I have some free time.

Additionally, I see that the log contains the following line:

Feb 21 07:21:55 PM: Your dataset appears to contain duplicated items (rows); when embedding, you should typically have unique items.

Having duplicated is typically ill-advised (and can sometimes lead to unexpected behavior), since it doesn't really make sense in the context of the embedding problem. You don't need two representations of the same thing.