facebookresearch / pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python!
Other
918 stars 145 forks source link

Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

Open kalbmj opened 6 years ago

kalbmj commented 6 years ago

Hello,

I am working with a sparse dataset that has many rows and cols: >>> X_train <1796130x3231961 sparse matrix of type '<type 'numpy.float64'>' with 207786451 stored elements in Compressed Sparse Row format>

I've started by working with the default params for the MultiClusterIndex creation, and had great luck on slices with smaller number of rows. For example: a subset of 12,000 rows took less then a minute to index, and a training dataset of 300,000 columns took less than 20mins to create the MultiClusterIndexes (both of these subsets used all columns).

When I attempt to run the same command on the entire dataset, it runs for a little over an hour and then throws the following error: cp0 = ci.MultiClusterIndex(X_train, Y_train)

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 427, in __init__ distance_type, matrix_size))) File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 154, in __init__ records_data[clustr], IndexError: index 1776193 is out of bounds for axis 1 with size 1776130

Do you have any suggestions for resolving this issue, or tweaking the parameters to make this dataset more efficient when creating the MultiClusterIndex?

Thank you in advance.

Update: ran into the same issue with Python version 3.7 and 2.7 (both out of bounds exceptions, trying to access different index locations for axis 1).

kalbmj commented 6 years ago

Another related question: would it be best with data of this size to set the matrix_size manually, so that it is something smaller and results in more levels of the tree than the recommended 2 levels? Thanks in advance.