Closed zeitderforschung closed 3 months ago
Thank you for the very detailed report, seems like it's closely related to #72 . I will look into your data and see I can replicate it somewhere.
The IndexError occurs because n_neighbors exceeds the available neighbors in the sample_neighbors_pair function.
def sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors):
...
for i in numba.prange(n):
scaled_sort = np.argsort(scaled_dist[i])
effective_neighbors = min(n_neighbors, scaled_sort.shape[0]) <-- that fixes the issue for me
for j in numba.prange(effective_neighbors):
...
But when i test the fixed version with another file pacmap_crash3.zip a new error occurs:
IndexError Traceback (most recent call last)
Cell In[8], line 4
1 #from pacmap import PaCMAP
3 while True:
----> 4 PaCMAP(n_components=2, random_state=42).fit_transform(vecs)
Cell In[5], line 956, in PaCMAP.fit_transform(self, X, init, save_pairs)
938 def fit_transform(self, X, init=None, save_pairs=True):
939 '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
940
941 Parameters
(...)
953 Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
954 '''
--> 956 self.fit(X, init, save_pairs)
957 if self.intermediate:
958 return self.intermediate_states
Cell In[5], line 919, in PaCMAP.fit(self, X, init, save_pairs)
917 self.num_dimensions = X.shape[1]
918 # Initialize and Optimize the embedding
--> 919 self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
920 X,
921 self.n_components,
922 self.pair_neighbors,
923 self.pair_MN,
924 self.pair_FP,
925 self.lr,
926 self.num_iters,
927 init,
928 self.verbose,
929 self.intermediate,
930 self.intermediate_snapshots,
931 pca_solution,
932 self.tsvd_transformer
933 )
934 if not save_pairs:
935 self.del_pairs()
Cell In[5], line 581, in pacmap(X, n_dims, pair_neighbors, pair_MN, pair_FP, lr, num_iters, Yinit, verbose, intermediate, inter_snapshots, pca_solution, tsvd)
578 for itr in range(num_iters_total):
579 w_MN, w_neighbors, w_FP = find_weight(w_MN_init, itr, num_iters=num_iters)
--> 581 grad = pacmap_grad(Y, pair_neighbors, pair_MN,
582 pair_FP, w_neighbors, w_MN, w_FP)
583 C = grad[-1, 0]
584 if verbose and itr == 0:
Cell In[5], line 248, in pacmap_grad(Y, pair_neighbors, pair_MN, pair_FP, w_neighbors, w_MN, w_FP)
246 d_ij = 1.0
247 for d in range(dim):
--> 248 y_ij[d] = Y[i, d] - Y[j, d]
249 d_ij += y_ij[d] ** 2
250 loss[0] += w_neighbors * (d_ij/(10. + d_ij))
IndexError: index 1684368754 is out of bounds for axis 0 with size 8
But this time, the error only occurs on the second run of the test code. That's pretty weird. My guess is that the problem here is caused by the randomly initialized np.empty() array in sample_neighbors_pair
, which is used as a memory access pointer in pacmap_grad
, causing an illegal memory access.
However, the out-of-bounds memory access of the compiled Numba code seems to crash my Docker environment, but not my local development machine. Looks like illegal memory access doesn't crash all systems, e.g. MacOS, so no one has noticed it yet. 🤔
Closing this issue as discussed in #72 .
This code crashes randomly with pacmap==0.7.2 . I have attached the numpy array that causes the crash in the pacmap_crash.zip. Strangely, the code works on my local MacBook, but fails on Docker.
Running the code with
NUMBA_DISABLE_JIT=1
prevents the kernel from crashing and you will get the following error message (tested with numba==0.60.0).IndexError Traceback (most recent call last) Cell In[4], line 2 1 while True: ----> 2 PaCMAP(n_components=2, random_state=42).fit_transform(vecs)
File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:952, in PaCMAP.fit_transform(self, X, init, save_pairs) 934 def fit_transform(self, X, init=None, save_pairs=True): 935 '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding. 936 937 Parameters (...) 949 Whether to save the pairs that are sampled from the dataset. Useful for reproducing results. 950 ''' --> 952 self.fit(X, init, save_pairs) 953 if self.intermediate: 954 return self.intermediate_states
File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:911, in PaCMAP.fit(self, X, init, save_pairs) 894 print_verbose( 895 "PaCMAP(n_neighbors={}, n_MN={}, n_FP={}, distance={}, " 896 "lr={}, n_iters={}, apply_pca={}, opt_method='adam', " (...) 908 ), self.verbose 909 ) 910 # Sample pairs --> 911 self.sample_pairs(X, self.save_tree) 912 self.num_instances = X.shape[0] 913 self.num_dimensions = X.shape[1]
File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:1029, in PaCMAP.sample_pairs(self, X, save_tree) 1027 print_verbose("Finding pairs", self.verbose) 1028 if self.pair_neighbors is None: -> 1029 self.pair_neighbors, self.pair_MN, self.pair_FP, self.tree = generate_pair( 1030 X, self.n_neighbors, self.n_MN, self.n_FP, self.distance, self.verbose 1031 ) 1032 print_verbose("Pairs sampled successfully.", self.verbose) 1033 elif self.pair_MN is None and self.pair_FP is None:
File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:479, in generate_pair(X, n_neighbors, n_MN, n_FP, distance, verbose) 477 scaled_dist = scale_dist(knn_distances, sig, nbrs) 478 print_verbose("Found scaled dist", verbose) --> 479 pair_neighbors = sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors) 480 if _RANDOM_STATE is None: 481 pair_MN = sample_MN_pair(X, n_MN, option)
File /usr/local/lib/python3.11/dist-packages/pacmap/pacmap.py:118, in sample_neighbors_pair(X, scaled_dist, nbrs, n_neighbors) 116 for j in numba.prange(n_neighbors): 117 pair_neighbors[in_neighbors + j][0] = i --> 118 pair_neighbors[in_neighbors + j][1] = nbrs[i][scaled_sort[j]] 119 return pair_neighbors
IndexError: index 9 is out of bounds for axis 0 with size 9