Closed huu4ontocord closed 7 years ago
Wow thanks! Ill take a look.
Do you have any intuition for why this change needs to happen?
records_index = np.arange(features.shape[0])
to
records_index = list(np.arange(features.shape[0]))
The other changes you suggest should be compatible. And this line
if feature != None and record != None:
should be something like
if (not feature is None) and (not record is None):
Thanks again!!!
"np.range" produces an iterator. It's like "range" in python3. You need to wrap a "list" function around it.
Btw, check out https://github.com/known-ai/KeyedVectorsANN
I folded your code into Gensim's KeyedVectors.
It was easier to fold all the code it into one file, but I can refactor to use the pysparnn package when its compatible with python 3. I made some changes to add a new method "most_similar", and storing indexes as the records_data instead of the actual words. This saves some space.
My model 260MB, and I'd like to find out how to reduce this size. I suspect it's mostly duplicates of the matrices.
Feel free to email me directly at ontocord@gmail.com
Thanks @known-ai ! I made the requested changes in this diff - https://github.com/facebookresearch/pysparnn/commit/1f976fa4d5c474bdee3e119f11e45764b3278447
Ill - send you an email.
I am not sure that there is much extra that is kept around in memory.
Check this modification to DenseMatrix dense_matrix-Copy1.pdf which also includes a study of data sizes.
Cool! I will try it Spence.
I am experimenting with a different selection of the clusters based on a derived ontology from the vectors.
I'll check out the paper!
Huu
On Mar 18, 2017, at 6:04 PM, Spence Beecher notifications@github.com wrote:
I am not sure that there is much extra that is kept around in memory.
Check this modification to DenseMatrix dense_matrix-Copy1.pdf which also includes a study of data sizes.
input features matrix is about the size of the ClusterIndex data structure You can reduce memory footprint by 4x (so long as your data can fit well into an int16) - see the DenseIntCosineDistance class. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
^ very cool. I think there is probably a 'better' (for some def of better) way to pick the clusters other than random. I am going to leave this open but ill close it in 2 weeks if the thread dies down.
I got it working for Anaconda3 by doing the following:
In cluster_pruning.py
123c123 < records_index = np.arange(features.shape[0])
In matrix_distance.py:
123c123,124 < arg_index = np.random.choice(len(scores), k, replace=False)
In init.py:
7c7 < from cluster_pruning import ClusterIndex, MultiClusterIndex
I you should just create two more files for ClusterIndex and MultiClusterIndex. Otherwise it will cause issues with importing in Python3 and backwards compatability for Python2