YingfanWang / PaCMAP

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
Apache License 2.0
512 stars 52 forks source link

`fit_transform` and `transform` on the same feature doesn't return the same value #48

Open duguyue100 opened 1 year ago

duguyue100 commented 1 year ago

Hi, thanks for developing PaCMAP, lovely work!

I found that using transform after using fit_transform on the same set of features yields different results.

I ran the following example:

import pacmap
import numpy as np

np.random.seed(0)

init = "pca"  # results can be reproduced also with "random"

reducer = pacmap.PaCMAP(
    n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, save_tree=True
)

features = np.random.randn(100, 30)

reduced_features = reducer.fit_transform(features, init=init)
print(reduced_features[:10])

transformed_features = reducer.transform(features)
print(transformed_features[:10])

And returns

[[ 0.7728913   3.785831  ]
 [-0.69379026  2.116452  ]
 [-1.7770871  -0.97542125]
 [ 2.5090704   1.8718773 ]
 [-0.06890291 -2.2959301 ]
 [ 1.9657456   1.1580495 ]
 [ 1.0486693  -1.4648851 ]
 [-1.4896832   1.7203271 ]
 [ 0.54106015  2.38868   ]
 [ 3.0175838  -1.9216222 ]]

[[-0.03516154  2.543376  ]
 [-0.467008    1.6641414 ]
 [-0.44973713 -1.535601  ]
 [ 1.0218439   1.5691875 ]
 [-0.30733356 -2.3227684 ]
 [ 0.8294033   1.0432268 ]
 [ 0.10503205 -0.8651409 ]
 [-0.63982046  0.59202313]
 [ 0.38573623  1.5135498 ]
 [ 2.0508025  -1.5033388 ]]

I would expect the same results because the fit_transform should be the combination of fit and transform (regardless of the implementation details). This is what PCA in sklearn and UMAP do.

Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is

reducer = reducer.fit(features, init=init)

# Now the following lines return the same feature.
reduced_features = reducer.transform(features)
transformed_features = reducer.transform(features)

But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from fit_transform and transform are different, I'm not sure I can trust the output of transform.

PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.

hyhuang00 commented 1 year ago

Hi there! Thank you for using PaCMAP. The result is expected to be different, since in PaCMAP the transform() function treats the input as additional data points that is expanded to the original data. In the current version, transform() will try to place the new input to their nearest neighbors' low dimension embeddings. As a result, there is no guarantee on whether the same points will always be placed to the same place. This design choice allows the points to be differentiated. However, as we said in the README, this feature is not finalized and we welcome any feedbacks towards its design. Is there any reason you want two data points that has the same value to be placed at the same place?

duguyue100 commented 1 year ago

Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.

My specific case was to write a test on our software to check if fit_transform and transform produce the same results. Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.

Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform and transform, then I think it makes sense to place a big bold warning in both readme and documentation.

MattWenham commented 1 year ago

Is there any reason you want two data points that has the same value to be placed at the same place?

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

hyhuang00 commented 1 year ago

Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.

My specific case was to write a test on our software to check if fit_transform and transform produce the same results. Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.

Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform and transform, then I think it makes sense to place a big bold warning in both readme and documentation.

Thank you for your suggestion! A warning has been added to the method, and we will think about ways to improve the transform method.

duguyue100 commented 1 year ago

@hyhuang00 Thanks for your effort, you can close this issue if you want.

hyhuang00 commented 1 year ago

Is there any reason you want two data points that has the same value to be placed at the same place?

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization. It helps the embedding to avoid the so-called "crowding problem" during the optimization, and sometimes it helps our users to know that there are multiple points exhibiting at the same place, forming a cluster. This might be less helpful when the embedding is used for other purposes. Perhaps we can make an option to allow different behavior.

MattWenham commented 1 year ago

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization.

Very true, but 'very similar values' and 'the same value' are two different use cases.

TCWO commented 1 year ago

Hi there, I am trying to fit a model with a smaller set and the apply the transform to a bigger set but I encountered this error which I assume is about the generating the neighbors. Can you let me know how I can handle it?

AssertionError Traceback (most recent call last) /tmp/ipykernel_623958/2284593526.py in ----> 1 data_all_dr, t_all_dr = DimRed2(data_sampl, data_norm, method = dr, dims=dims)

/tmp/ipykernel_623958/736146467.py in DimRed2(df1, df2, method, dims, pca) 84 85 # Now, use the fitted model to transform a larger dataset (X_large) ---> 86 dr = embedding.transform(X2, init='pca', save_pairs=False) 87 88 end = time.time()

~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs) 932 self.apply_pca, self.verbose) 933 # Sample pairs --> 934 self.pair_XP = generate_extra_pair_basis(basis, X, 935 self.n_neighbors, 936 self.tree,

~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose) 397 npr, dimp = X.shape 398 --> 399 assert (basis is not None or tree is not None), "If the annoyindex is not cached, the original dataset must be provided." 400 401 # Build the tree again if not cached

AssertionError: If the annoyindex is not cached, the original dataset must be provided.

and here is my function X is the smaller set and X2 the big dataset: elif method == 'PaCMAP':

slightly different since we need to transform the dataframe to an array as an input for the pacmap function

        start = time.time()
        X = data
        X = np.asarray(X)
        X = X.reshape(X.shape[0], -1)
        X2 = data2
        X2 = np.asarray(X2)
        X2 = X2.reshape(X2.shape[0], -1)
        # Setting n_neighbors to "None" leads to a default choice 
        embedding = pacmap.PaCMAP(n_components=dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0) 
        # fit the data (The index of transformed data corresponds to the index of the original data)
        #embedding.fit(X, init="pca")
        #dr = embedding.transform(X2)

        # Fit and transform using a smaller dataset (X_small)
        embedding_small = embedding.fit_transform(X, init='pca', save_pairs=True)

        # Now, use the fitted model to transform a larger dataset (X_large)
        dr = embedding.transform(X2, init='pca', save_pairs=False)

        end = time.time()
        t = end-start
escheer commented 6 months ago

Hello, thank you for PACMAP, beautiful work.

I second this question. I am reaching:

AssertionError: If the annoyindex is not cached, the original dataset must be provided.

when i call the transform method on a new dataset after it has already been fit on a previous one. It is desirable to be able to transform new data into an existing embedding space. Can you provide some guidance on this?

EDIT: this was due to the fact that I had not specified save_tree = True. Might be good to spell that out a bit more clearly in the documentation! Thank you :)