boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 289 forks source link

Issue on Multipartite Rank #46

Closed edoost closed 6 years ago

edoost commented 6 years ago

Hi.

I am using Multipartite Rank to extract keypharses from Persian documents and I have two questions:

  1. In the paper, it is stated that the candidates with the pattern (/adj* noun+/) are selected. In Persian adjectives appear after nouns, how to make it work correctly in this case?

  2. Topics are selected based on the stems of the words. How should I input the stems when I'm using the 'preprocessed' mode to read the documemts?

Thanks

boudinfl commented 6 years ago

Hi @edoost,

For (1), you should set a new pattern for selecting the candidates. For (2), the nltk stemmer is used by default and Persian is not currently supported. So there are two solutions: write a new input reader method on read_raw_document(), or use CoreNLP input file in which the stems are placed as lemmas and use the use_lemmas=True option in read_document().

Below is a snippet of code that summarize that:

import pke

extractor = pke.unsupervised.MultipartiteRank(input_file=input_file)
extractor.read_document(format='corenlp', use_lemmas=True)
extractor.grammar_selection(grammar="NP: {<NN.*>+<JJ.*>*}")
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=5)
for u, v in keyphrases:
  print(u, v)

f.

edoost commented 6 years ago

@boudinfl Thank you very much. It's working.