Problem with large data set

Fethbita commented 6 years ago

When I try to build the search index just like in the example I get a Memory Error:. The data I use is wiki data, I have ~50 gb ram than can be used.

MemoryError                               Traceback (most recent call last)
<ipython-input-15-9815ea3148b7> in <module>()
      1 import pysparnn.cluster_index as ci
----> 2 cp = ci.MultiClusterIndex(tfidfdtmatrix, doctexts)
      3 k = 10
      4 nq = 1

~/miniconda3/lib/python3.6/site-packages/pysparnn/cluster_index.py in __init__(self, features, records_data, distance_type, matrix_size, num_indexes)
    425         for _ in range(num_indexes):
    426             self.indexes.append((ClusterIndex(features, records_data,
--> 427                                               distance_type, matrix_size)))
    428 
    429     def insert(self, feature, record):

~/miniconda3/lib/python3.6/site-packages/pysparnn/cluster_index.py in __init__(self, features, records_data, distance_type, matrix_size, parent)
    121         else:
    122             self.is_terminal = False
--> 123             records_data = _np.array(records_data)
    124 
    125             records_index = list(_np.arange(features.shape[0]))

MemoryError:

>>> tfidfdtmatrix
<310993x1225250 sparse matrix of type '<class 'numpy.float64'>'
    with 34322030 stored elements in Compressed Sparse Row format>
>>> print(type(doctexts))
<class 'list'>
>>> print(type(doctexts[0]))
<class 'str'>

Fethbita commented 6 years ago

Oh I understand. Instead of giving the whole data set, I can give indices and it won't have to allocate that much space.

doc_index = np.array(range(len(doctexts))).astype(int)
import pysparnn.cluster_index as ci
cp = ci.MultiClusterIndex(tfidfdtmatrix, doc_index)

fixed it. Is there an easy way to write the search index to disk?

spencebeecher commented 6 years ago

Pickle package should work!

On Tue, Jul 10, 2018, 2:41 AM Burak Can notifications@github.com wrote:

Oh I understand. Instead of giving the whole data set, I can give indices and it won't have to allocate that much space.

doc_index = np.array(range(len(doctexts))).astype(int) import pysparnn.cluster_index as ci cp = ci.MultiClusterIndex(tfidfdtmatrix, doc_index)

fixed it. Is there an easy way to write the search index to disk?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/19#issuecomment-403763696, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXLXZ3p2p5hJHFzXThZZWcmTQObVZdaks5uFHa6gaJpZM4VI_o8 .

Fethbita commented 6 years ago

Thanks.

facebookresearch / pysparnn

Problem with large data set #19