boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

Support for non-english language #192

Closed shyambhu-mukherjee closed 2 years ago

shyambhu-mukherjee commented 2 years ago

Hi! thanks for building this awesome package. I just wanted to understand what are the possible changes we would need on this package to use it for non-english languages such as european ( german, polish etc) and/or asian languages such as hindi, chinese etc. Would love a basic checklist and suggestions on how to get started on the same?

boudinfl commented 2 years ago

Hi @shyambhu-mukherjee,

Short answer : it depends on the language you want to process I guess.

Long answer : pke builds on spacy for text processing and therefore languages that are available through spacy models should work out-of-the-box.

For example German, be sure to install the model:

python -m spacy download de_core_news_sm

Then you can process your files as simply as:

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be a simple 
# test string and preprocessing is carried out using spacy
extractor.load_document(input='text', language='de')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

f.

shyambhu-mukherjee commented 2 years ago

@boudinfl I tried to use the pke for german language with the following string. german_string = Wir sind eine ganz normale Familie. Ich wohne zusammen mit meinen Eltern, meiner kleinen Schwester Lisa und unserer Katze Mick. Meine Großeltern wohnen im gleichen Dorf wie wir. Oma Francis arbeitet noch. Sie ist Krankenschwester. Die Anderen sind schon in Rente. Oma Lydia nimmt sich viel Zeit für mich und geht häufig mit mir Kleider oder Schuhe kaufen. Leider will meine kleine Schwester dann auch immer mit. Mein Vater arbeitet bei einer Bank und fährt am Wochenende gern mit seinem Motorrad. Das findet meine Mutter nicht so gut, da sie meint, dass Motorradfahren so gefährlich ist. Sie sagt, dass ich und meine Schwester auf keinen Fall mitfahren dürfen. Mein Vater versteht das nicht, aber er will sich auch nicht streiten. Nächstes Jahr wollen wir in ein größeres Haus ziehen, weil meine Eltern noch ein Baby bekommen. Ich hoffe, dass wir nicht zu weit weg ziehen, da alle meine Freunde hier in der Nähe wohnen. Meine Tante Clara, die Schwester meiner Mutter, wohnt sogar genau gegenüber. Meine Cousine Barbara kommt deshalb häufig zu Besuch.

I used the following code:

extractor = pke.unsupervised.TopicRank()
extractor.load_document(input = german_string, language = 'de')

This part of code worked. extractor.candidate_selection() this code works with the following error:

    if set(words).intersection(self.stoplist):
TypeError: 'NoneType' object is not iterable

The code still runs and phrases come up. But Just wanted to know why this error comes; and if it possibly changes the result.

m-janyell0w commented 2 years ago

Hey there,

I am having issues using the TfIdf-extractor for non-english languages, such as german and spanish.

I made sure to dowload the spacy models beforehand using spacy download non_english_model.

Since there are no stemmers available in german or spanish according to spacy, I set normalization=None when calling compute_document_frequency. My code looks like this:

def get_keywords_tfidf(corpus: pd.DataFrame, test_set: list, language: str, spacy_model="en_core_web_sm") -> list:

        # load spacy model
        nlp = spacy.load(spacy_model)

        # get document frequencies from whole corupus
        if language != 'en':
                print('not english -> no stemming')
                compute_document_frequency(documents=corpus, output_file=f'data/df_{language}.df.gz', language=language, n=3, normalization=None)
        else:
                compute_document_frequency(documents=corpus, output_file=f'data/df_{language}.df.gz', language=language, n=3)
        df = load_document_frequency_file(input_file=f'data/df_{language}.df.gz')

        # start timer
        t = process_time()

        keywords = []
        extractor = TfIdf()

        for i, doc in tqdm(enumerate(test_set)):
                doc = preprocess_text_gensim(doc, language)
                extractor.load_document(input=doc, language='en')
                extractor.candidate_selection(n=4)
                extractor.candidate_weighting(df=df)
                keywords.append([u for u,v in extractor.get_n_best(n=20, stemming=False)])

        # end timer
        elapsed_time = round(process_time() - t, 0)
        print(f"Process time of TF*IDF keywords extractor: {elapsed_time} secs.")

        return keywords, elapsed_time

# Run keyword extraction for german docs
keywords_de, duration_de = get_keywords_tfidf(df_de, test_set_de, lang, spacy_model_de)

Calling the function for english works fine, but with german and spanish I get the error message: TypeError: 'NoneType' object is not iterable when calling compute_document_frequency.

Can you tell me whether I'm making a mistake in my code or is it that these languages are not supported by spacy / pke - tfidf class?

ygorg commented 2 years ago

Hi @m-janyell0w , thanks for your detailed issue, i wonder what line in the traceback triggered the TypeError: 'NoneType' in compute_document_frequency.

Also, I don't think that compute_document_frequency takes as input a pd.Dataframe, it should be a list of spacy.Doc or a list of string. Please try modifying compute_document_frequency(documents=corpus,... to compute_document_frequency(documents=corpus['the column containing text'].tolist(), ...

m-janyell0w commented 1 year ago

Sry, for the late answer. Thank you for pointing our the wrong datatype given to compute_document_frequency. But even, when changing it to a list of documents the Error remains. The most recent traceback points to this line:

--> 421 if set(words).intersection(self.stoplist):

I did not specify a stoplist in my function. Could it be that there is no stopwords list for german avaible, which causes this NoneType Error?