boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.55k stars 290 forks source link

tf-idf with lemmatizer #75

Closed pitskod closed 5 years ago

pitskod commented 5 years ago

For tf-idf there is no way to have tf for lemmatize form of word (we can count tf for stemmed word or for word with no normalization). Maybe in load_file method in # word normalization section we need to add condition for lemmatization like : elif self.normalization is 'lemmatization': for i, sentence in enumerate(self.sentences): self.sentences[i].stems = sentence.stems

boudinfl commented 5 years ago

Hi @pitskod,

Sorry for the very late response. pke does allow the compute TF for lemmatized words by setting the normalization parameter to lemmatization in the load_document()method.

import pke

text = '''pke is an open source python-based keyphrase extraction toolkit.'''

extractor = pke.unsupervised.TopicRank()

extractor.load_document(input=text, language='en', normalization=None)
print(extractor.sentences[0].stems)
> ['pke', 'is', 'an', 'open', 'source', 'python', '-', 'based', 'keyphrase', 'extraction', 'toolkit', '.']

extractor.load_document(input=text, language='en', normalization='stemming')
print(extractor.sentences[0].stems)
> ['pke', 'is', 'an', 'open', 'sourc', 'python', '-', 'base', 'keyphras', 'extract', 'toolkit', '.']

extractor.load_document(input=text, language='en', normalization='lemmatization')
print(extractor.sentences[0].stems)
> ['pke', 'be', 'an', 'open', 'source', 'python', '-', 'base', 'keyphrase', 'extraction', 'toolkit', '.']

Please let me know if you encounter any issue with that.

f.