Open duguyue100 opened 1 year ago
Hi there! Thank you for using PaCMAP. The result is expected to be different, since in PaCMAP the transform()
function treats the input as additional data points that is expanded to the original data. In the current version, transform()
will try to place the new input to their nearest neighbors' low dimension embeddings. As a result, there is no guarantee on whether the same points will always be placed to the same place. This design choice allows the points to be differentiated. However, as we said in the README, this feature is not finalized and we welcome any feedbacks towards its design. Is there any reason you want two data points that has the same value to be placed at the same place?
Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.
My specific case was to write a test on our software to check if fit_transform
and transform
produce the same results.
Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.
Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform
and transform
, then I think it makes sense to place a big bold warning in both readme and documentation.
Is there any reason you want two data points that has the same value to be placed at the same place?
Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?
Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.
My specific case was to write a test on our software to check if
fit_transform
andtransform
produce the same results. Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's
fit_transform
andtransform
, then I think it makes sense to place a big bold warning in both readme and documentation.
Thank you for your suggestion! A warning has been added to the method, and we will think about ways to improve the transform
method.
@hyhuang00 Thanks for your effort, you can close this issue if you want.
Is there any reason you want two data points that has the same value to be placed at the same place?
Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?
Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization. It helps the embedding to avoid the so-called "crowding problem" during the optimization, and sometimes it helps our users to know that there are multiple points exhibiting at the same place, forming a cluster. This might be less helpful when the embedding is used for other purposes. Perhaps we can make an option to allow different behavior.
Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?
Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization.
Very true, but 'very similar values' and 'the same value' are two different use cases.
AssertionError Traceback (most recent call last)
/tmp/ipykernel_623958/2284593526.py in
/tmp/ipykernel_623958/736146467.py in DimRed2(df1, df2, method, dims, pca) 84 85 # Now, use the fitted model to transform a larger dataset (X_large) ---> 86 dr = embedding.transform(X2, init='pca', save_pairs=False) 87 88 end = time.time()
~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs) 932 self.apply_pca, self.verbose) 933 # Sample pairs --> 934 self.pair_XP = generate_extra_pair_basis(basis, X, 935 self.n_neighbors, 936 self.tree,
~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose) 397 npr, dimp = X.shape 398 --> 399 assert (basis is not None or tree is not None), "If the annoyindex is not cached, the original dataset must be provided." 400 401 # Build the tree again if not cached
AssertionError: If the annoyindex is not cached, the original dataset must be provided.
and here is my function X is the smaller set and X2 the big dataset: elif method == 'PaCMAP':
start = time.time()
X = data
X = np.asarray(X)
X = X.reshape(X.shape[0], -1)
X2 = data2
X2 = np.asarray(X2)
X2 = X2.reshape(X2.shape[0], -1)
# Setting n_neighbors to "None" leads to a default choice
embedding = pacmap.PaCMAP(n_components=dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
# fit the data (The index of transformed data corresponds to the index of the original data)
#embedding.fit(X, init="pca")
#dr = embedding.transform(X2)
# Fit and transform using a smaller dataset (X_small)
embedding_small = embedding.fit_transform(X, init='pca', save_pairs=True)
# Now, use the fitted model to transform a larger dataset (X_large)
dr = embedding.transform(X2, init='pca', save_pairs=False)
end = time.time()
t = end-start
Hello, thank you for PACMAP, beautiful work.
I second this question. I am reaching:
AssertionError: If the annoyindex is not cached, the original dataset must be provided.
when i call the transform method on a new dataset after it has already been fit on a previous one. It is desirable to be able to transform new data into an existing embedding space. Can you provide some guidance on this?
EDIT: this was due to the fact that I had not specified save_tree = True. Might be good to spell that out a bit more clearly in the documentation! Thank you :)
Hi, thanks for developing PaCMAP, lovely work!
I found that using
transform
after usingfit_transform
on the same set of features yields different results.I ran the following example:
And returns
I would expect the same results because the
fit_transform
should be the combination offit
andtransform
(regardless of the implementation details). This is what PCA in sklearn and UMAP do.Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is
But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from
fit_transform
andtransform
are different, I'm not sure I can trust the output oftransform
.PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.