Closed andyhappy1 closed 7 years ago
Hey @andyhappy1 , For such short documents, you could argue that every word (excluding stop words, perhaps) or every noun is a key word — and that's a much simpler algorithm to implement than TextRank. :)
I don't think the TextRank algorithm is really applicable here, so the fact that the function returns no keywords may, in a certain sense, be "correct". TextRank builds a network of co-occurring words within a certain window (IIRC set to 2 in the code), but for a 7-word document, this network doesn't contain a lot of information about the relationships between words. You could certainly call keyterms.key_terms_from_semantic_network()
and specify a different window_width
, but I would not expect better results.
Okie doke. I think for short phrases I will use your POS pattern function which returns verbs and NP as keyterms.
On Mon, Oct 23, 2017 at 6:23 PM, Burton DeWilde notifications@github.com wrote:
Closed #135 https://github.com/chartbeat-labs/textacy/issues/135.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chartbeat-labs/textacy/issues/135#event-1306550460, or mute the thread https://github.com/notifications/unsubscribe-auth/AEN4SVDKc2apgN1KbfpAua_aHMFvWVUVks5svQPPgaJpZM4QDfIo .
-- Get it,
Susanne Andy Rossignol
USA: +01 (315) 508-4186
Australia: +61 404 720 102
Hi,
I am converting user input into a document, and then adding to corpus , and using keyword.textrank to get keyword(s).
For short text (less than 7 words), keyword.textrank typically results in 0 keywords.
Is there a way for me to fiddle with filtering cutoffs so that keywords can be determined in shorter text?
Here is an example:
vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,min_df=2, max_df=0.95) doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) for doc in corpus))
model = textacy.TopicModel('nmf', n_topics=10) model.fit(doc_term_matrix) doc_topic_matrix = model.transform(doc_term_matrix) doc_topic_matrix.shape
content = u'do you speak any other languages' metadata= {'title': 'A Search for 2nd-generation Leptoquarks at s = 7 TeV', 'author': 'Burton DeWilde', 'pub_date': '2012-08-01'} doc = textacy.Doc(content, metadata=metadata)
print(list(textacy.extract.ngrams(doc, 2, filter_stops=True, filter_punct=True, filter_nums=False))[:1]) keyword = textacy.keyterms.textrank(doc, n_keyterms=1) print([i[0] for i in keyword])