MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 337 forks source link

Combination of KeyBERT + BERTopic returns an error #168

Open mpoiaganova opened 1 year ago

mpoiaganova commented 1 year ago

Hello,

Not sure if that's an issue of KeyBERT or more of a BERTopic, but I am trying to run KeyBERT + BERTopic as explained in the documentation, and getting a ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I am running the exact same two cells as in the documentation, so the problem does not come from my custom data/input. Attaching the error log screenshots. Thanks in advance!

sh1 sh2

MaartenGr commented 1 year ago

With respect to your code, it is difficult to say without seeing the full picture. What is in vocabulary? How many words are in there? Also, how many documents are you passing to BERTopic? More specifically, it might be that you do not have enough words in the vocabulary for each cluster to actually contain at least one word.

MaartenGr commented 1 year ago

Also, if you are interested in using a KeyBERT-like algorithm in BERTopic, I would advise applying BERTopic's KeyBERTInspired representation model.

rubypnchl commented 1 year ago

Also, if you are interested in using a KeyBERT-like algorithm in BERTopic, I would advise applying BERTopic's KeyBERTInspired representation model.

Hi Maarten, I am also facing not exactly but something related problem, when I am using keybert+keyphrasevectorizer to generate vocabulary for bertopic. It is giving two kinds of issues: 1) memory issues 2) kernel crashed even for 20k abstracts (while using on WSL). It works on windows for upto 100k abstracts. I want to know: First, Can we speed up the combination of keybert+keyphrasevectorizer( for 100k abstracts it took 13 hours for vocabulary generation). Second, how to resolve this repetitive kernel dying problem. below is the code I am using

from keybert import KeyBERT from sentence_transformers import SentenceTransformer import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(device) logging.info("Starting KeyBERT...") sentence_model = SentenceTransformer("paraphrase-MiniLM-L12-v2", device=device) sentence_model = sentence_model.to(device) logging.info(f"Using device: {sentence_model.device}") kw_model = KeyBERT(sentence_model)

import pickle import os from keyphrase_vectorizers import KeyphraseCountVectorizer

vectorizer_model=KeyphraseCountVectorizer()

check if file exists

keyword_file = f"{YEAR_MONTH}/keywords.dump" if os.path.exists(keyword_file): with open(keyword_file, "rb") as fp: keywords = pickle.load(fp) else: keywords = kw_model.extract_keywords(abstracts,vectorizer=vectorizer_model,use_mmr=True, keyphrase_ngram_range=(1,5)) with open(keyword_file, "wb") as fp: pickle.dump(keywords, fp) logging.info(f"Extracted {len(keywords)} keywords.")

for the above code kernel crashes even for 20k abstracts on wsl.

Thanks in advance!

MaartenGr commented 1 year ago

I believe this has to do with how the KeyphraseCountVectorizer creates the candidate keywords to be checked which can be computationally quite expensive. Perhaps looking at the KeyphraseCountVectorizer hyperparameters might help but I am not quite sure. I would advise sharing your use case at the repo there.

mpoiaganova commented 1 year ago

With respect to your code, it is difficult to say without seeing the full picture. What is in vocabulary? How many words are in there? Also, how many documents are you passing to BERTopic? More specifically, it might be that you do not have enough words in the vocabulary for each cluster to actually contain at least one word.

Thanks for the answer, and sorry I was not enough clear. I was running the exactly same code as from the documentation, image attached, so vocabulary was initialized as per that example. I thought that reproducing that example should not result in such an error, or could it?

I also tried to test with my own documents and vocabulary and made sure that the vocabulary contained enough words to cluster the documents, but it didn't work out either with the same error.

Screenshot 2023-03-10 at 20 29 38
MaartenGr commented 1 year ago

Hmmm, I am not entirely sure what is happening. I'll have to take a look. Either way, I would advise using KeyBERTInspired instead as it is much more optimized for this task and has similar performance. Moreover, I might just remove that piece of the documentation here as KeyBERTInspired was created for just this.

mpoiaganova commented 1 year ago

Ok, I'll use KeyBERTInspired then.

Thanks for advice and for the effort in creating KeyBERT and BERTopic: great helpful tools!