YingfanWang / PaCMAP

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
Apache License 2.0
556 stars 54 forks source link

Crashing when fitting PaCMAP #72

Closed krishnaxamin closed 5 months ago

krishnaxamin commented 6 months ago

Hi,

I am trying to use PaCMAP on a 68x249 dataset, but it is not working. Here is my code:

linear_pacmap = pacmap.PaCMAP(n_neighbors=None, random_state=42)
linear_pacmap.fit(for_linear_pca_scaled, init='pca')

And here is the error I get: Process finished with exit code -1073741819 (0xC0000005)

I installed PaCMAP via pip and am on Windows 11, using PyCharm. Do you have any ideas for what is happening? Thanks.

hyhuang00 commented 6 months ago

Hi there, thank you for your interest in PaCMAP. It's hard to know the real underlying cause for this error. I notice that the size of this dataset seems to be too small and the number of dimensions is larger than the number of points you have, which may cause problem during the PCA. Can you try a larger dataset and see if you receive the same error?

abhishek-ghose commented 6 months ago

@hyhuang00 your check helped me, thank you! At some point in my code, the number of points were less than the number of dimensions - appropriately handling these cases got rid of the issue.

A follow-up question if you don't mind - is there no way to use PaCMAP for this case? Even when I set init='random' this crash, i.e., # points < # dimensions, occurs. But this is a common case, e.g., you may have 200 text documents, and you are representing the text with 512-dimensional sentence embeddings.

hyhuang00 commented 6 months ago

@hyhuang00 your check helped me, thank you! At some point in my code, the number of points were less than the number of dimensions - appropriately handling these cases got rid of the issue.

A follow-up question if you don't mind - is there no way to use PaCMAP for this case? Even when I set init='random' this crash, i.e., # points < # dimensions, occurs. But this is a common case, e.g., you may have 200 text documents, and you are representing the text with 512-dimensional sentence embeddings.

@abhishek-ghose Glad to hear it solved your case. Regarding your follow-up question: the problem probably does not rise from the PaCMAP code itself, but from the annoy module that was used to perform nearest neighbor calculation. It will be great if you can provide the error you got when PaCMAP crashed, and I can look into this problem in more details.

abhishek-ghose commented 6 months ago

Thanks! The error is: Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)

Funnily, testing it out a bit more, it looks like this only happens for some values. Here's some sample code:

import numpy as np
import pacmap

def pacmap_issue_72(num_points, num_dims):
    proj_dims = 2
    X = np.random.random((num_points, num_dims))
    print(f"\nShape of original data: {np.shape(X)}")

    embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
    X_transformed = embedding.fit_transform(X, init="pca")
    print(f"Shape of proj. data: {np.shape(X_transformed)}")

If I make these function calls:

pacmap_issue_72(num_points=100, num_dims=500)
pacmap_issue_72(num_points=5, num_dims=50)

This is the output I see:

Shape of original data: (100, 500)
Shape of proj. data: (100, 2)

Shape of original data: (5, 50)

Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)

Do you think this might be happening when the number of instances is too small?

EDIT: Python syntax highlighting.

hyhuang00 commented 5 months ago

It seems like this issue might be platform specific? I tested your code and it runs without the SIGSEGV problem. Here's an edited version of your code I used to perform the test. Running your original version doesn't lead to any problem either.

import numpy as np
import sklearn
import numba
import pacmap

def pacmap_issue_72(num_points, num_dims):
    proj_dims = 2
    X = np.random.random((num_points, num_dims))
    print(f"\nShape of original data: {np.shape(X)}")

    embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
    X_transformed = embedding.fit_transform(X, init="pca")
    print(f"Shape of proj. data: {np.shape(X_transformed)}")

if __name__ == "__main__":
    print(f"Numpy: {np.__version__}")
    print(f"Numba: {numba.__version__}")
    print(f"Scikit-learn: {sklearn.__version__}")
    print(f"PaCMAP: {pacmap.__version__}")
    for i in range(20):
        num_pts = 100 - i * 5
        pacmap_issue_72(num_points=num_pts, num_dims=50)

For reference, I'm using Numpy: 1.24.4 Numba: 0.57.1 Scikit-learn: 1.1.3 pacmap 0.7.2. Experiments were performed on a Macbook.

abhishek-ghose commented 5 months ago

Thanks for checking it. Using Python's faulthandler I was able to trace the issue to a numba call.


Current thread 0x00007f0610b680c0 (most recent call first):
  File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 479 in generate_pair
  File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 1029 in sample_pairs
  File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 911 in fit
  File "/media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py", line 952 in fit_transform
  File "/media/aghose/DATA/sources/misc/pacmap_demo.py", line 218 in pacmap_issue_72
  File "/media/aghose/DATA/sources/misc/pacmap_demo.py", line 251 in <module>

The stack points to line 479 in pacmap.py but walking through the codebase I noticed that the code exits at line 114 (in the call to sample_neighbors_pair():

for i in numba.prange(n):

I will try digging deeper.

hyhuang00 commented 5 months ago

@abhishek-ghose I noticed that you are installing pacmap in a non-default location. Are you using it in a cluster environment? If that's the case, it's possible that discrepancy between the node you compile the pacmap code and the node you actually use the pacmap code causes the error. For a quick fix of this problem, you can try manually set all cache configuration into cache=False in your pacmap code (say, /media/aghose/DATA/anaconda39/lib/python3.9/site-packages/pacmap/pacmap.py), and then retry this experiment. I will try to find a way to make this parameter tunable to users.

abhishek-ghose commented 5 months ago

Sorry for disappearing for a while. I realized my setup was leading to some other OpenMP related errors. So I decided to perform a clean install, after which I have:

  1. OS Ubuntu 24.04
  2. numpy 1.26.4
  3. numba 0.59.1
  4. scikit-learn 1.2.2
  5. pacmap 0.7.2

Alas the bug survives :-/

This is a single machine, Python just happens to be in a non-standard location. I tried you suggestion (thanks!) but that didn't work. But since the general direction seemed to be to not rely on the cache, I tried clearing temp files as mentioned in this SO thread. Unfortunately, the crash still happens.

What finally worked - and this is an extremely hacky solution - is that I now add some synthetic points- which are slightly perturbed versions of the original points (so as to not disturb the relative spacing) - to bring the the number of points to 20, and then I don't see the crash. Yes, this is a bad fix.

zeitderforschung commented 5 months ago
import numpy as np
import sklearn
import numba
import pacmap

def pacmap_issue_72(num_points, num_dims):
    proj_dims = 2
    X = np.random.random((num_points, num_dims))
    print(f"\nShape of original data: {np.shape(X)}")

    embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
    X_transformed = embedding.fit_transform(X, init="pca")
    print(f"Shape of proj. data: {np.shape(X_transformed)}")

if __name__ == "__main__":
    print(f"Numpy: {np.__version__}")
    print(f"Numba: {numba.__version__}")
    print(f"Scikit-learn: {sklearn.__version__}")
    print(f"PaCMAP: {pacmap.__version__}")
    for i in range(20):
        num_pts = 100 - i * 5
        pacmap_issue_72(num_points=num_pts, num_dims=50)

I ran this code in jupyter notbook with %env NUMBA_DISABLE_JIT=1 and it failed. I think the error is that n_neighbors > scaled_sort.shape[0] in sample_neighbors_pair. Additionally np.empty() is used in sample_neighbors_pair instead of np.zeros(), wich would cause another out-of-bounds error in pacmap_grad as the random values will be used as array positions.

👉 My guess is that without disabling jit, the illegal memory access won't cause a segmentation error on some systems, but it will on others.

Numpy: 1.26.3 Numba: 0.60.0 Scikit-learn: 1.2.2 PaCMAP: 0.7.2

Shape of original data: (100, 50) Shape of proj. data: (100, 2)

Shape of original data: (95, 50) Shape of proj. data: (95, 2)

Shape of original data: (90, 50) Shape of proj. data: (90, 2)

Shape of original data: (85, 50) Shape of proj. data: (85, 2)

Shape of original data: (80, 50) Shape of proj. data: (80, 2)

Shape of original data: (75, 50) Shape of proj. data: (75, 2)

Shape of original data: (70, 50) Shape of proj. data: (70, 2)

Shape of original data: (65, 50) Shape of proj. data: (65, 2)

Shape of original data: (60, 50) Shape of proj. data: (60, 2)

Shape of original data: (55, 50) Shape of proj. data: (55, 2)

Shape of original data: (50, 50) Shape of proj. data: (50, 2)

Shape of original data: (45, 50) Shape of proj. data: (45, 2)

Shape of original data: (40, 50) Shape of proj. data: (40, 2)

Shape of original data: (35, 50) Shape of proj. data: (35, 2)

Shape of original data: (30, 50) Shape of proj. data: (30, 2)

Shape of original data: (25, 50) Shape of proj. data: (25, 2)

Shape of original data: (20, 50) Shape of proj. data: (20, 2)

Shape of original data: (15, 50) Shape of proj. data: (15, 2)

Shape of original data: (10, 50)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[4], line 22
     20 for i in range(20):
     21     num_pts = 100 - i * 5
---> 22     pacmap_issue_72(num_points=num_pts, num_dims=50)

Cell In[4], line 12, in pacmap_issue_72(num_points, num_dims)
      9 print(f"\nShape of original data: {np.shape(X)}")
     11 embedding = pacmap.PaCMAP(n_components=proj_dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
---> 12 X_transformed = embedding.fit_transform(X, init="pca")
     13 print(f"Shape of proj. data: {np.shape(X_transformed)}")

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:952, in PaCMAP.fit_transform(self, X, init, save_pairs)
    934 def fit_transform(self, X, init=None, save_pairs=True):
    935     '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
    936 
    937     Parameters
   (...)
    949         Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
    950     '''
--> 952     self.fit(X, init, save_pairs)
    953     if self.intermediate:
    954         return self.intermediate_states

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:911, in PaCMAP.fit(self, X, init, save_pairs)
    894 print_verbose(
    895     "PaCMAP(n_neighbors={}, n_MN={}, n_FP={}, distance={}, "
    896     "lr={}, n_iters={}, apply_pca={}, opt_method='adam', "
   (...)
    908     ), self.verbose
    909 )
    910 # Sample pairs
--> 911 self.sample_pairs(X, self.save_tree)
    912 self.num_instances = X.shape[0]
    913 self.num_dimensions = X.shape[1]

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:1029, in PaCMAP.sample_pairs(self, X, save_tree)
   1027 print_verbose("Finding pairs", self.verbose)
   1028 if self.pair_neighbors is None:
-> 1029     self.pair_neighbors, self.pair_MN, self.pair_FP, self.tree = generate_pair(
   1030         X, self.n_neighbors, self.n_MN, self.n_FP, self.distance, self.verbose
   1031     )
   1032     print_verbose("Pairs sampled successfully.", self.verbose)
   1033 elif self.pair_MN is None and self.pair_FP is None:

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:479, in generate_pair(X, n_neighbors, n_MN, n_FP, distance, verbose)
    477 scaled_dist = scale_dist(knn_distances, sig, nbrs)
    478 print_verbose("Found scaled dist", verbose)
--> 479 pair_neighbors = sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors)
    480 if _RANDOM_STATE is None:
    481     pair_MN = sample_MN_pair(X, n_MN, option)

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pacmap/pacmap.py:118, in sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors)
    116     for j in numba.prange(n_neighbors):
    117         pair_neighbors[i*n_neighbors + j][0] = i
--> 118         pair_neighbors[i*n_neighbors + j][1] = nbrs[i][scaled_sort[j]]
    119 return pair_neighbors

IndexError: index 9 is out of bounds for axis 0 with size 9
hyhuang00 commented 5 months ago

@zeitderforschung Thanks again for the very detailed report, I will push a hotfix today and will prepare release 0.7.3 to fix this problem.

zeitderforschung commented 5 months ago

Thank you very much for this great piece of software, which has been the "secret sauce" in many of our projects.

hyhuang00 commented 5 months ago

Problem should have been solved in the latest release. Please reopen the issue if you encountered similar problems again.

abhishek-ghose commented 5 months ago

I can confirm that the segfault doesn't occur anymore with release 0.7.3 - thank you!