boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.57k stars 291 forks source link

How to manipulate length of key-phrases? #205

Closed shyambhu-mukherjee closed 2 years ago

shyambhu-mukherjee commented 2 years ago

I am applying the multipartite and topical rank methods in some phrase extraction method and was wondering if there is some parameter which I can manipulate to get longer phrases. Would appreciate any suggestions. @boudinfl

ygorg commented 2 years ago

In order to longer keyphrases you can try using ngram_selection with a high n instead of the default candidate_selection. Please note that keyphrases are generally 2 to 3 tokens long. Maybe what you are trying to extract is not keyphrases and an other tool or nlp task would more suited.

m-janyell0w commented 2 years ago

Hi there,

I want to extract single keywords from website texts using pke, unigrams, one could say. Following code snippet results in keywords variable being an empty list. My data is a pd.dataframe with a number of rows and two columns (the second one contains the document for keyword extraction).

row = 1
sample_de = test_set_df_de.iloc[row, 1]

# initialize keyphrase extractor
extractor = pke.unsupervised.SingleRank()

# load the document for extraction
extractor.load_document(input=sample_de, language='de', normalization=None)
extractor.grammar_selection(grammar=r"NP:{<NOUN|PROPN|ADJ>+}")
extractor.ngram_selection(n=1)

# Select n best keywords
keywords = extractor.get_n_best(n=10, stemming=False)

Even for n>1 no keywords are extracted. Previously I also tried using the .candidate_weighting method, which however resulted in a KeyError :(.

I would be grateful if you could point in the right direction. Thanks! :) Micha

ygorg commented 1 year ago

Hi, sorry for the late response @m-janyell0w I missed your message.

SingleRank only works with candidates composed of NOUN, PROPN and ADJ. This is why KeyError is raised. Though this could be made clearer or made to allow for any POStag. Thanks for raising this issue ! As an alternative please try using PositionRank or MultipartiteRank.

You should always call extractor.candidate_weighting so that extractor.get_n_best can output something.

For the candidate selection, you should only use one technique (each one removes previously selected candidates). In order to have $n$-grams exactly you can alter the grammar like this : "NP:{<NOUN|PROPN|ADJ>}" for 1-grams (without the + which means one or more), "NP:{<NOUN|PROPN|ADJ>{3}}" for 3-grams, etc (and even {2,5} for 2 to 5 grams ?)