chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

keyword.textrank finds 0 keywords on short text #135

Closed andyhappy1 closed 7 years ago

andyhappy1 commented 7 years ago

Hi,

I am converting user input into a document, and then adding to corpus , and using keyword.textrank to get keyword(s).

For short text (less than 7 words), keyword.textrank typically results in 0 keywords.

Is there a way for me to fiddle with filtering cutoffs so that keywords can be determined in shorter text?


Here is an example:

vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,min_df=2, max_df=0.95) doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) for doc in corpus))

model = textacy.TopicModel('nmf', n_topics=10) model.fit(doc_term_matrix) doc_topic_matrix = model.transform(doc_term_matrix) doc_topic_matrix.shape

content = u'do you speak any other languages' metadata= {'title': 'A Search for 2nd-generation Leptoquarks at s = 7 TeV', 'author': 'Burton DeWilde', 'pub_date': '2012-08-01'} doc = textacy.Doc(content, metadata=metadata)

print(list(textacy.extract.ngrams(doc, 2, filter_stops=True, filter_punct=True, filter_nums=False))[:1]) keyword = textacy.keyterms.textrank(doc, n_keyterms=1) print([i[0] for i in keyword])

bdewilde commented 7 years ago

Hey @andyhappy1 , For such short documents, you could argue that every word (excluding stop words, perhaps) or every noun is a key word — and that's a much simpler algorithm to implement than TextRank. :)

I don't think the TextRank algorithm is really applicable here, so the fact that the function returns no keywords may, in a certain sense, be "correct". TextRank builds a network of co-occurring words within a certain window (IIRC set to 2 in the code), but for a 7-word document, this network doesn't contain a lot of information about the relationships between words. You could certainly call keyterms.key_terms_from_semantic_network()and specify a different window_width, but I would not expect better results.

andyhappy1 commented 7 years ago

Okie doke. I think for short phrases I will use your POS pattern function which returns verbs and NP as keyterms.

On Mon, Oct 23, 2017 at 6:23 PM, Burton DeWilde notifications@github.com wrote:

Closed #135 https://github.com/chartbeat-labs/textacy/issues/135.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chartbeat-labs/textacy/issues/135#event-1306550460, or mute the thread https://github.com/notifications/unsubscribe-auth/AEN4SVDKc2apgN1KbfpAua_aHMFvWVUVks5svQPPgaJpZM4QDfIo .

-- Get it,

Susanne Andy Rossignol

USA: +01 (315) 508-4186

Australia: +61 404 720 102