YingfanWang / PaCMAP

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
Apache License 2.0
512 stars 52 forks source link

Segmentation fault due to IndexError #74

Closed zeitderforschung closed 3 months ago

zeitderforschung commented 3 months ago

This code crashes randomly with pacmap==0.7.2 . I have attached the numpy array that causes the crash in the pacmap_crash.zip. Strangely, the code works on my local MacBook, but fails on Docker.

from pacmap import PaCMAP
import numpy as np

vecs = np.load("pacmap_crash.npy")
print(f"shape: {vecs.shape}")

while True:
    PaCMAP(n_components=2, random_state=42).fit_transform(vecs)

Running the code with NUMBA_DISABLE_JIT=1 prevents the kernel from crashing and you will get the following error message (tested with numba==0.60.0).


IndexError Traceback (most recent call last) Cell In[4], line 2 1 while True: ----> 2 PaCMAP(n_components=2, random_state=42).fit_transform(vecs)

File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:952, in PaCMAP.fit_transform(self, X, init, save_pairs) 934 def fit_transform(self, X, init=None, save_pairs=True): 935 '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding. 936 937 Parameters (...) 949 Whether to save the pairs that are sampled from the dataset. Useful for reproducing results. 950 ''' --> 952 self.fit(X, init, save_pairs) 953 if self.intermediate: 954 return self.intermediate_states

File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:911, in PaCMAP.fit(self, X, init, save_pairs) 894 print_verbose( 895 "PaCMAP(n_neighbors={}, n_MN={}, n_FP={}, distance={}, " 896 "lr={}, n_iters={}, apply_pca={}, opt_method='adam', " (...) 908 ), self.verbose 909 ) 910 # Sample pairs --> 911 self.sample_pairs(X, self.save_tree) 912 self.num_instances = X.shape[0] 913 self.num_dimensions = X.shape[1]

File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:1029, in PaCMAP.sample_pairs(self, X, save_tree) 1027 print_verbose("Finding pairs", self.verbose) 1028 if self.pair_neighbors is None: -> 1029 self.pair_neighbors, self.pair_MN, self.pair_FP, self.tree = generate_pair( 1030 X, self.n_neighbors, self.n_MN, self.n_FP, self.distance, self.verbose 1031 ) 1032 print_verbose("Pairs sampled successfully.", self.verbose) 1033 elif self.pair_MN is None and self.pair_FP is None:

File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:479, in generate_pair(X, n_neighbors, n_MN, n_FP, distance, verbose) 477 scaled_dist = scale_dist(knn_distances, sig, nbrs) 478 print_verbose("Found scaled dist", verbose) --> 479 pair_neighbors = sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors) 480 if _RANDOM_STATE is None: 481 pair_MN = sample_MN_pair(X, n_MN, option)

File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:118, in sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors) 116 for j in numba.prange(n_neighbors): 117 pair_neighbors[in_neighbors + j][0] = i --> 118 pair_neighbors[in_neighbors + j][1] = nbrs[i][scaled_sort[j]] 119 return pair_neighbors

IndexError: index 9 is out of bounds for axis 0 with size 9

hyhuang00 commented 3 months ago

Thank you for the very detailed report, seems like it's closely related to #72 . I will look into your data and see I can replicate it somewhere.

zeitderforschung commented 3 months ago

The IndexError occurs because n_neighbors exceeds the available neighbors in the sample_neighbors_pair function.

def sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors):
    ...
    for i in numba.prange(n):
        scaled_sort = np.argsort(scaled_dist[i])
        effective_neighbors = min(n_neighbors, scaled_sort.shape[0])  <-- that fixes the issue for me
        for j in numba.prange(effective_neighbors):
            ...

But when i test the fixed version with another file pacmap_crash3.zip a new error occurs:

IndexError                                Traceback (most recent call last)
Cell In[8], line 4
      1 #from pacmap import PaCMAP 
      3 while True:
----> 4     PaCMAP(n_components=2, random_state=42).fit_transform(vecs)

Cell In[5], line 956, in PaCMAP.fit_transform(self, X, init, save_pairs)
    938 def fit_transform(self, X, init=None, save_pairs=True):
    939     '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
    940 
    941     Parameters
   (...)
    953         Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
    954     '''
--> 956     self.fit(X, init, save_pairs)
    957     if self.intermediate:
    958         return self.intermediate_states

Cell In[5], line 919, in PaCMAP.fit(self, X, init, save_pairs)
    917 self.num_dimensions = X.shape[1]
    918 # Initialize and Optimize the embedding
--> 919 self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
    920     X,
    921     self.n_components,
    922     self.pair_neighbors,
    923     self.pair_MN,
    924     self.pair_FP,
    925     self.lr,
    926     self.num_iters,
    927     init,
    928     self.verbose,
    929     self.intermediate,
    930     self.intermediate_snapshots,
    931     pca_solution,
    932     self.tsvd_transformer
    933 )
    934 if not save_pairs:
    935     self.del_pairs()

Cell In[5], line 581, in pacmap(X, n_dims, pair_neighbors, pair_MN, pair_FP, lr, num_iters, Yinit, verbose, intermediate, inter_snapshots, pca_solution, tsvd)
    578 for itr in range(num_iters_total):
    579     w_MN, w_neighbors, w_FP = find_weight(w_MN_init, itr, num_iters=num_iters)
--> 581     grad = pacmap_grad(Y, pair_neighbors, pair_MN,
    582                        pair_FP, w_neighbors, w_MN, w_FP)
    583     C = grad[-1, 0]
    584     if verbose and itr == 0:

Cell In[5], line 248, in pacmap_grad(Y, pair_neighbors, pair_MN, pair_FP, w_neighbors, w_MN, w_FP)
    246 d_ij = 1.0
    247 for d in range(dim):
--> 248     y_ij[d] = Y[i, d] - Y[j, d]
    249     d_ij += y_ij[d] ** 2
    250 loss[0] += w_neighbors * (d_ij/(10. + d_ij))

IndexError: index 1684368754 is out of bounds for axis 0 with size 8

But this time, the error only occurs on the second run of the test code. That's pretty weird. My guess is that the problem here is caused by the randomly initialized np.empty() array in sample_neighbors_pair, which is used as a memory access pointer in pacmap_grad, causing an illegal memory access.

However, the out-of-bounds memory access of the compiled Numba code seems to crash my Docker environment, but not my local development machine. Looks like illegal memory access doesn't crash all systems, e.g. MacOS, so no one has noticed it yet. 🤔

hyhuang00 commented 3 months ago

Closing this issue as discussed in #72 .