Closed krishnaxamin closed 5 months ago
Hi there, thank you for your interest in PaCMAP. It's hard to know the real underlying cause for this error. I notice that the size of this dataset seems to be too small and the number of dimensions is larger than the number of points you have, which may cause problem during the PCA. Can you try a larger dataset and see if you receive the same error?
@hyhuang00 your check helped me, thank you! At some point in my code, the number of points were less than the number of dimensions - appropriately handling these cases got rid of the issue.
A follow-up question if you don't mind - is there no way to use PaCMAP for this case? Even when I set init='random'
this crash, i.e., # points < # dimensions, occurs. But this is a common case, e.g., you may have 200 text documents, and you are representing the text with 512-dimensional sentence embeddings.
@hyhuang00 your check helped me, thank you! At some point in my code, the number of points were less than the number of dimensions - appropriately handling these cases got rid of the issue.
A follow-up question if you don't mind - is there no way to use PaCMAP for this case? Even when I set
init='random'
this crash, i.e., # points < # dimensions, occurs. But this is a common case, e.g., you may have 200 text documents, and you are representing the text with 512-dimensional sentence embeddings.
@abhishek-ghose Glad to hear it solved your case. Regarding your follow-up question: the problem probably does not rise from the PaCMAP code itself, but from the annoy
module that was used to perform nearest neighbor calculation. It will be great if you can provide the error you got when PaCMAP crashed, and I can look into this problem in more details.
Thanks! The error is:
Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)
Funnily, testing it out a bit more, it looks like this only happens for some values. Here's some sample code:
import numpy as np
import pacmap
def pacmap_issue_72(num_points, num_dims):
proj_dims = 2
X = np.random.random((num_points, num_dims))
print(f"\nShape of original data: {np.shape(X)}")
embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
X_transformed = embedding.fit_transform(X, init="pca")
print(f"Shape of proj. data: {np.shape(X_transformed)}")
If I make these function calls:
pacmap_issue_72(num_points=100, num_dims=500)
pacmap_issue_72(num_points=5, num_dims=50)
This is the output I see:
Shape of original data: (100, 500)
Shape of proj. data: (100, 2)
Shape of original data: (5, 50)
Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)
Do you think this might be happening when the number of instances is too small?
EDIT: Python syntax highlighting.
It seems like this issue might be platform specific? I tested your code and it runs without the SIGSEGV problem. Here's an edited version of your code I used to perform the test. Running your original version doesn't lead to any problem either.
import numpy as np
import sklearn
import numba
import pacmap
def pacmap_issue_72(num_points, num_dims):
proj_dims = 2
X = np.random.random((num_points, num_dims))
print(f"\nShape of original data: {np.shape(X)}")
embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
X_transformed = embedding.fit_transform(X, init="pca")
print(f"Shape of proj. data: {np.shape(X_transformed)}")
if __name__ == "__main__":
print(f"Numpy: {np.__version__}")
print(f"Numba: {numba.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")
print(f"PaCMAP: {pacmap.__version__}")
for i in range(20):
num_pts = 100 - i * 5
pacmap_issue_72(num_points=num_pts, num_dims=50)
For reference, I'm using
Numpy: 1.24.4 Numba: 0.57.1 Scikit-learn: 1.1.3 pacmap 0.7.2
. Experiments were performed on a Macbook.
Thanks for checking it. Using Python's faulthandler I was able to trace the issue to a numba
call.
Current thread 0x00007f0610b680c0 (most recent call first):
File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 479 in generate_pair
File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 1029 in sample_pairs
File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 911 in fit
File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 952 in fit_transform
File "/media/aghose/DATA/sources/misc/pacmap_demo.py", line 218 in pacmap_issue_72
File "/media/aghose/DATA/sources/misc/pacmap_demo.py", line 251 in <module>
The stack points to line 479 in pacmap.py
but walking through the codebase I noticed that the code exits at line 114 (in the call to sample_neighbors_pair()
:
for i in numba.prange(n):
I will try digging deeper.
@abhishek-ghose I noticed that you are installing pacmap in a non-default location. Are you using it in a cluster environment? If that's the case, it's possible that discrepancy between the node you compile the pacmap code and the node you actually use the pacmap code causes the error. For a quick fix of this problem, you can try manually set all cache
configuration into cache=False
in your pacmap code (say, /media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py
), and then retry this experiment. I will try to find a way to make this parameter tunable to users.
Sorry for disappearing for a while. I realized my setup was leading to some other OpenMP related errors. So I decided to perform a clean install, after which I have:
numpy
1.26.4numba
0.59.1scikit-learn
1.2.2pacmap
0.7.2Alas the bug survives :-/
This is a single machine, Python just happens to be in a non-standard location. I tried you suggestion (thanks!) but that didn't work. But since the general direction seemed to be to not rely on the cache, I tried clearing temp files as mentioned in this SO thread. Unfortunately, the crash still happens.
What finally worked - and this is an extremely hacky solution - is that I now add some synthetic points- which are slightly perturbed versions of the original points (so as to not disturb the relative spacing) - to bring the the number of points to 20, and then I don't see the crash. Yes, this is a bad fix.
import numpy as np import sklearn import numba import pacmap def pacmap_issue_72(num_points, num_dims): proj_dims = 2 X = np.random.random((num_points, num_dims)) print(f"\nShape of original data: {np.shape(X)}") embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0) X_transformed = embedding.fit_transform(X, init="pca") print(f"Shape of proj. data: {np.shape(X_transformed)}") if __name__ == "__main__": print(f"Numpy: {np.__version__}") print(f"Numba: {numba.__version__}") print(f"Scikit-learn: {sklearn.__version__}") print(f"PaCMAP: {pacmap.__version__}") for i in range(20): num_pts = 100 - i * 5 pacmap_issue_72(num_points=num_pts, num_dims=50)
I ran this code in jupyter notbook with %env NUMBA_DISABLE_JIT=1
and it failed. I think the error is that n_neighbors > scaled_sort.shape[0] in sample_neighbors_pair
. Additionally np.empty() is used in sample_neighbors_pair
instead of np.zeros(), wich would cause another out-of-bounds error in pacmap_grad
as the random values will be used as array positions.
👉 My guess is that without disabling jit, the illegal memory access won't cause a segmentation error on some systems, but it will on others.
Numpy: 1.26.3 Numba: 0.60.0 Scikit-learn: 1.2.2 PaCMAP: 0.7.2
Shape of original data: (100, 50) Shape of proj. data: (100, 2)
Shape of original data: (95, 50) Shape of proj. data: (95, 2)
Shape of original data: (90, 50) Shape of proj. data: (90, 2)
Shape of original data: (85, 50) Shape of proj. data: (85, 2)
Shape of original data: (80, 50) Shape of proj. data: (80, 2)
Shape of original data: (75, 50) Shape of proj. data: (75, 2)
Shape of original data: (70, 50) Shape of proj. data: (70, 2)
Shape of original data: (65, 50) Shape of proj. data: (65, 2)
Shape of original data: (60, 50) Shape of proj. data: (60, 2)
Shape of original data: (55, 50) Shape of proj. data: (55, 2)
Shape of original data: (50, 50) Shape of proj. data: (50, 2)
Shape of original data: (45, 50) Shape of proj. data: (45, 2)
Shape of original data: (40, 50) Shape of proj. data: (40, 2)
Shape of original data: (35, 50) Shape of proj. data: (35, 2)
Shape of original data: (30, 50) Shape of proj. data: (30, 2)
Shape of original data: (25, 50) Shape of proj. data: (25, 2)
Shape of original data: (20, 50) Shape of proj. data: (20, 2)
Shape of original data: (15, 50) Shape of proj. data: (15, 2)
Shape of original data: (10, 50)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[4], line 22
20 for i in range(20):
21 num_pts = 100 - i * 5
---> 22 pacmap_issue_72(num_points=num_pts, num_dims=50)
Cell In[4], line 12, in pacmap_issue_72(num_points, num_dims)
9 print(f"\nShape of original data: {np.shape(X)}")
11 embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
---> 12 X_transformed = embedding.fit_transform(X, init="pca")
13 print(f"Shape of proj. data: {np.shape(X_transformed)}")
File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:952, in PaCMAP.fit_transform(self, X, init, save_pairs)
934 def fit_transform(self, X, init=None, save_pairs=True):
935 '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
936
937 Parameters
(...)
949 Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
950 '''
--> 952 self.fit(X, init, save_pairs)
953 if self.intermediate:
954 return self.intermediate_states
File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:911, in PaCMAP.fit(self, X, init, save_pairs)
894 print_verbose(
895 "PaCMAP(n_neighbors={}, n_MN={}, n_FP={}, distance={}, "
896 "lr={}, n_iters={}, apply_pca={}, opt_method='adam', "
(...)
908 ), self.verbose
909 )
910 # Sample pairs
--> 911 self.sample_pairs(X, self.save_tree)
912 self.num_instances = X.shape[0]
913 self.num_dimensions = X.shape[1]
File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:1029, in PaCMAP.sample_pairs(self, X, save_tree)
1027 print_verbose("Finding pairs", self.verbose)
1028 if self.pair_neighbors is None:
-> 1029 self.pair_neighbors, self.pair_MN, self.pair_FP, self.tree = generate_pair(
1030 X, self.n_neighbors, self.n_MN, self.n_FP, self.distance, self.verbose
1031 )
1032 print_verbose("Pairs sampled successfully.", self.verbose)
1033 elif self.pair_MN is None and self.pair_FP is None:
File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:479, in generate_pair(X, n_neighbors, n_MN, n_FP, distance, verbose)
477 scaled_dist = scale_dist(knn_distances, sig, nbrs)
478 print_verbose("Found scaled dist", verbose)
--> 479 pair_neighbors = sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors)
480 if _RANDOM_STATE is None:
481 pair_MN = sample_MN_pair(X, n_MN, option)
File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:118, in sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors)
116 for j in numba.prange(n_neighbors):
117 pair_neighbors[i*n_neighbors + j][0] = i
--> 118 pair_neighbors[i*n_neighbors + j][1] = nbrs[i][scaled_sort[j]]
119 return pair_neighbors
IndexError: index 9 is out of bounds for axis 0 with size 9
@zeitderforschung Thanks again for the very detailed report, I will push a hotfix today and will prepare release 0.7.3 to fix this problem.
Thank you very much for this great piece of software, which has been the "secret sauce" in many of our projects.
Problem should have been solved in the latest release. Please reopen the issue if you encountered similar problems again.
I can confirm that the segfault doesn't occur anymore with release 0.7.3
- thank you!
Hi,
I am trying to use PaCMAP on a 68x249 dataset, but it is not working. Here is my code:
And here is the error I get:
Process finished with exit code -1073741819 (0xC0000005)
I installed PaCMAP via pip and am on Windows 11, using PyCharm. Do you have any ideas for what is happening? Thanks.