Closed shyambhu-mukherjee closed 2 years ago
Hi @shyambhu-mukherjee,
Short answer : it depends on the language you want to process I guess.
Long answer : pke
builds on spacy
for text processing and therefore languages that are available through spacy
models should work out-of-the-box.
For example German, be sure to install the model:
python -m spacy download de_core_news_sm
Then you can process your files as simply as:
# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()
# load the content of the document, here document is expected to be a simple
# test string and preprocessing is carried out using spacy
extractor.load_document(input='text', language='de')
# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()
# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()
# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)
f.
@boudinfl I tried to use the pke for german language with the following string.
german_string = Wir sind eine ganz normale Familie. Ich wohne zusammen mit meinen Eltern, meiner kleinen Schwester Lisa und unserer Katze Mick. Meine Großeltern wohnen im gleichen Dorf wie wir. Oma Francis arbeitet noch. Sie ist Krankenschwester. Die Anderen sind schon in Rente. Oma Lydia nimmt sich viel Zeit für mich und geht häufig mit mir Kleider oder Schuhe kaufen. Leider will meine kleine Schwester dann auch immer mit. Mein Vater arbeitet bei einer Bank und fährt am Wochenende gern mit seinem Motorrad. Das findet meine Mutter nicht so gut, da sie meint, dass Motorradfahren so gefährlich ist. Sie sagt, dass ich und meine Schwester auf keinen Fall mitfahren dürfen. Mein Vater versteht das nicht, aber er will sich auch nicht streiten. Nächstes Jahr wollen wir in ein größeres Haus ziehen, weil meine Eltern noch ein Baby bekommen. Ich hoffe, dass wir nicht zu weit weg ziehen, da alle meine Freunde hier in der Nähe wohnen. Meine Tante Clara, die Schwester meiner Mutter, wohnt sogar genau gegenüber. Meine Cousine Barbara kommt deshalb häufig zu Besuch.
I used the following code:
extractor = pke.unsupervised.TopicRank()
extractor.load_document(input = german_string, language = 'de')
This part of code worked.
extractor.candidate_selection()
this code works with the following error:
if set(words).intersection(self.stoplist):
TypeError: 'NoneType' object is not iterable
The code still runs and phrases come up. But Just wanted to know why this error comes; and if it possibly changes the result.
Hey there,
I am having issues using the TfIdf-extractor for non-english languages, such as german and spanish.
I made sure to dowload the spacy models beforehand using spacy download non_english_model
.
Since there are no stemmers available in german or spanish according to spacy, I set normalization=None
when calling compute_document_frequency
. My code looks like this:
def get_keywords_tfidf(corpus: pd.DataFrame, test_set: list, language: str, spacy_model="en_core_web_sm") -> list:
# load spacy model
nlp = spacy.load(spacy_model)
# get document frequencies from whole corupus
if language != 'en':
print('not english -> no stemming')
compute_document_frequency(documents=corpus, output_file=f'data/df_{language}.df.gz', language=language, n=3, normalization=None)
else:
compute_document_frequency(documents=corpus, output_file=f'data/df_{language}.df.gz', language=language, n=3)
df = load_document_frequency_file(input_file=f'data/df_{language}.df.gz')
# start timer
t = process_time()
keywords = []
extractor = TfIdf()
for i, doc in tqdm(enumerate(test_set)):
doc = preprocess_text_gensim(doc, language)
extractor.load_document(input=doc, language='en')
extractor.candidate_selection(n=4)
extractor.candidate_weighting(df=df)
keywords.append([u for u,v in extractor.get_n_best(n=20, stemming=False)])
# end timer
elapsed_time = round(process_time() - t, 0)
print(f"Process time of TF*IDF keywords extractor: {elapsed_time} secs.")
return keywords, elapsed_time
# Run keyword extraction for german docs
keywords_de, duration_de = get_keywords_tfidf(df_de, test_set_de, lang, spacy_model_de)
Calling the function for english works fine, but with german and spanish I get the error message: TypeError: 'NoneType' object is not iterable
when calling compute_document_frequency.
Can you tell me whether I'm making a mistake in my code or is it that these languages are not supported by spacy / pke - tfidf class?
Hi @m-janyell0w ,
thanks for your detailed issue, i wonder what line in the traceback triggered the TypeError: 'NoneType'
in compute_document_frequency
.
Also, I don't think that compute_document_frequency
takes as input a pd.Dataframe
, it should be a list of spacy.Doc
or a list of string. Please try modifying compute_document_frequency(documents=corpus,...
to compute_document_frequency(documents=corpus['the column containing text'].tolist(), ...
Sry, for the late answer.
Thank you for pointing our the wrong datatype given to compute_document_frequency
. But even, when changing it to a list of documents the Error remains. The most recent traceback points to this line:
--> 421 if set(words).intersection(self.stoplist):
I did not specify a stoplist in my function. Could it be that there is no stopwords list for german avaible, which causes this NoneType Error?
Hi! thanks for building this awesome package. I just wanted to understand what are the possible changes we would need on this package to use it for non-english languages such as european ( german, polish etc) and/or asian languages such as hindi, chinese etc. Would love a basic checklist and suggestions on how to get started on the same?