clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.74k stars 1.58k forks source link

Documents not sortable in py3 #273

Open pachewise opened 5 years ago

pachewise commented 5 years ago

We're having an issue with pattern==3.6 where if there are duplicates, etc in the model documents, getting the nsmallest fails for vector_space_search:

from pattern.en import lexeme
from pattern.vector import Document, LEMMA, TFIDF, Model
responses = ['it is works great.  ', 'bristles are soft and compact enough', 'the aftertaste isnt as bad as others. ', 'i dont know. it isnt something i think about.', 'bristles are soft and compact enough']
exclude = ['t', 'im']
docs = [Document(response, stemmer=LEMMA, name=str(i), exclude=exclude, stopwords=False) for i, response in enumerate(responses)]
m = Model(documents=docs, weight=TFIDF)
results = m.search(words=lexeme('bristle'), top=100)

Results in:

image

(if you're wondering, here's why it works in py2 - from https://docs.python.org/2/library/stdtypes.html#comparisons) image

tuxayo commented 4 years ago

See also #62

Bounty here: https://github.com/clips/pattern/issues/62#issuecomment-391473725