Open liuchenbaidu opened 5 years ago
@liuchenbaidu indeed that code doesn't work with sparse matrices, the test actually uses dense which is why this went unnoticed. I did implement this separately somewhere using scikit's euclidean distance but it is so much slower than cosine that it begs the question whether you need it.
import pysparnn.cluster_index as ci
from sklearn.feature_extraction.text import TfidfVectorizer import pysparnn data = [ 'hello world', 'oh hello there', 'Play it', 'Play it again Sam', ] data=['你在干什么', '你在干啥子', '你在做什么', '你好啊', '我喜欢吃香蕉']
tv = TfidfVectorizer() tv.fit(data)
features_vec = tv.transform(data) print(type(features_vec),features_vec.shape)
build the search index!
cp = ci.MultiClusterIndex(features_vec, data,pysparnn.matrix_distance.SlowEuclideanDistance)
search the index with a sparse matrix
search_data = [ 'oh there', 'Play it again Frank' ]
search_data = [ '你在干啥','我喜欢吃香蕉' ] search_features_vec = tv.transform(search_data)
res=cp.search(search_features_vec, k=3, k_clusters=3, return_distance=False)
print(res)