Closed shyambhu-mukherjee closed 2 years ago
In order to longer keyphrases you can try using ngram_selection
with a high n
instead of the default candidate_selection
.
Please note that keyphrases are generally 2 to 3 tokens long. Maybe what you are trying to extract is not keyphrases and an other tool or nlp task would more suited.
Hi there,
I want to extract single keywords from website texts using pke, unigrams, one could say. Following code snippet results in keywords
variable being an empty list.
My data is a pd.dataframe with a number of rows and two columns (the second one contains the document for keyword extraction).
row = 1
sample_de = test_set_df_de.iloc[row, 1]
# initialize keyphrase extractor
extractor = pke.unsupervised.SingleRank()
# load the document for extraction
extractor.load_document(input=sample_de, language='de', normalization=None)
extractor.grammar_selection(grammar=r"NP:{<NOUN|PROPN|ADJ>+}")
extractor.ngram_selection(n=1)
# Select n best keywords
keywords = extractor.get_n_best(n=10, stemming=False)
Even for n>1 no keywords are extracted.
Previously I also tried using the .candidate_weighting
method, which however resulted in a KeyError :(.
I would be grateful if you could point in the right direction. Thanks! :) Micha
Hi, sorry for the late response @m-janyell0w I missed your message.
SingleRank
only works with candidates composed of NOUN, PROPN and ADJ. This is why KeyError
is raised. Though this could be made clearer or made to allow for any POStag. Thanks for raising this issue !
As an alternative please try using PositionRank
or MultipartiteRank
.
You should always call extractor.candidate_weighting
so that extractor.get_n_best
can output something.
For the candidate selection, you should only use one technique (each one removes previously selected candidates). In order to have $n$-grams exactly you can alter the grammar like this : "NP:{<NOUN|PROPN|ADJ>}"
for 1-grams (without the + which means one or more), "NP:{<NOUN|PROPN|ADJ>{3}}"
for 3-grams, etc (and even {2,5}
for 2 to 5 grams ?)
I am applying the multipartite and topical rank methods in some phrase extraction method and was wondering if there is some parameter which I can manipulate to get longer phrases. Would appreciate any suggestions. @boudinfl