facebookresearch / pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python!
Other
918 stars 145 forks source link

Python3/Anaconda compatability #12

Closed huu4ontocord closed 7 years ago

huu4ontocord commented 7 years ago

I got it working for Anaconda3 by doing the following:

In cluster_pruning.py

123c123 < records_index = np.arange(features.shape[0])

        records_index = list(np.arange(features.shape[0]))

131c131 < np.arange(clusters_selection.shape[0]))

                             list(np.arange(clusters_selection.shape[0])))

223c223 < if feature <> None and record <> None:

    if feature != None and record != None:

273a274 elements = list(elements)

In matrix_distance.py:

123c123,124 < arg_index = np.random.choice(len(scores), k, replace=False)

            lenScores = len(scores)
            arg_index = np.random.choice(lenScores, min(lenScores, k), replace=False)

329a331

In init.py:

7c7 < from cluster_pruning import ClusterIndex, MultiClusterIndex

from .cluster_pruning import ClusterIndex, MultiClusterIndex

I you should just create two more files for ClusterIndex and MultiClusterIndex. Otherwise it will cause issues with importing in Python3 and backwards compatability for Python2

spencebeecher commented 7 years ago

Wow thanks! Ill take a look.

Do you have any intuition for why this change needs to happen? records_index = np.arange(features.shape[0]) to records_index = list(np.arange(features.shape[0]))

The other changes you suggest should be compatible. And this line if feature != None and record != None: should be something like if (not feature is None) and (not record is None): Thanks again!!!

huu4ontocord commented 7 years ago

"np.range" produces an iterator. It's like "range" in python3. You need to wrap a "list" function around it.

Btw, check out https://github.com/known-ai/KeyedVectorsANN

I folded your code into Gensim's KeyedVectors.

It was easier to fold all the code it into one file, but I can refactor to use the pysparnn package when its compatible with python 3. I made some changes to add a new method "most_similar", and storing indexes as the records_data instead of the actual words. This saves some space.

My model 260MB, and I'd like to find out how to reduce this size. I suspect it's mostly duplicates of the matrices.

Feel free to email me directly at ontocord@gmail.com

spencebeecher commented 7 years ago

Thanks @known-ai ! I made the requested changes in this diff - https://github.com/facebookresearch/pysparnn/commit/1f976fa4d5c474bdee3e119f11e45764b3278447

spencebeecher commented 7 years ago

Ill - send you an email.

spencebeecher commented 7 years ago

I am not sure that there is much extra that is kept around in memory.

Check this modification to DenseMatrix dense_matrix-Copy1.pdf which also includes a study of data sizes.

huu4ontocord commented 7 years ago

Cool! I will try it Spence.

I am experimenting with a different selection of the clusters based on a derived ontology from the vectors.

I'll check out the paper!

Huu

On Mar 18, 2017, at 6:04 PM, Spence Beecher notifications@github.com wrote:

I am not sure that there is much extra that is kept around in memory.

Check this modification to DenseMatrix dense_matrix-Copy1.pdf which also includes a study of data sizes.

input features matrix is about the size of the ClusterIndex data structure You can reduce memory footprint by 4x (so long as your data can fit well into an int16) - see the DenseIntCosineDistance class. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

spencebeecher commented 7 years ago

^ very cool. I think there is probably a 'better' (for some def of better) way to pick the clusters other than random. I am going to leave this open but ill close it in 2 weeks if the thread dies down.