Closed shengbo-ma closed 1 year ago
Thank you for your kind words! The seed_keywords
parameter is used to define a set of keywords for which you would like the documents to be guided towards. Although this parameter can be used for batch documents, it is only the case if you want the documents to be guided towards a single set of terms, instead of a set of terms that differs for each document. It would definitely be nice if the latter was enabled.
I am pretty interested and willing to see if I could contribute. Is there suggestion how pull requests should be made?
Feel free to open up a PR. Simplicity within KeyBERT is key, if it is something that can be implemented in a few straightforward lines, then that would be preferred. We are aiming here for minimal and elegant implementations.
With respect to the implementation, it might be helpful to automatically detect whether seed_keywords
contains a nested list and/or if it is the same length as the documents. So perhaps something like this:
# Extract embeddings
if seed_keywords is not None:
if isinstance(seed_keywords[0], str):
seed_embeddings = self.model.embed([" ".join(seed_keywords)])
else:
seed_embeddings = [self.model.embed([" ".join(keywords)]) for keywords in seed_keywords]
# Guided KeyBERT with seed keywords
if seed_keywords is not None:
if isinstance(seed_keywords[0], str):
doc_embedding = np.average(
[doc_embedding, seed_embeddings], axis=0, weights=[3, 1]
)
else:
doc_embedding = np.average(
[doc_embedding, seed_embeddings[index]], axis=0, weights=[3, 1]
)
# Guided KeyBERT with seed keywords <- This might be a more elegant solution
if seed_keywords is not None:
seed_index = index if isinstance(seed_keywords[0], str) else slice(None)
doc_embedding = np.average(
[doc_embedding, seed_embeddings[seed_index]], axis=0, weights=[3, 1]
)
Thank you for your kind words! The
seed_keywords
parameter is used to define a set of keywords for which you would like the documents to be guided towards. Although this parameter can be used for batch documents, it is only the case if you want the documents to be guided towards a single set of terms, instead of a set of terms that differs for each document. It would definitely be nice if the latter was enabled.I am pretty interested and willing to see if I could contribute. Is there suggestion how pull requests should be made?
Feel free to open up a PR. Simplicity within KeyBERT is key, if it is something that can be implemented in a few straightforward lines, then that would be preferred. We are aiming here for minimal and elegant implementations.
With respect to the implementation, it might be helpful to automatically detect whether
seed_keywords
contains a nested list and/or if it is the same length as the documents. So perhaps something like this:# Extract embeddings if seed_keywords is not None: if isinstance(seed_keywords[0], str): seed_embeddings = self.model.embed([" ".join(seed_keywords)]) else: seed_embeddings = [self.model.embed([" ".join(keywords)]) for keywords in seed_keywords] # Guided KeyBERT with seed keywords if seed_keywords is not None: if isinstance(seed_keywords[0], str): doc_embedding = np.average( [doc_embedding, seed_embeddings], axis=0, weights=[3, 1] ) else: doc_embedding = np.average( [doc_embedding, seed_embeddings[index]], axis=0, weights=[3, 1] ) # Guided KeyBERT with seed keywords <- This might be a more elegant solution if seed_keywords is not None: seed_index = index if isinstance(seed_keywords[0], str) else slice(None) doc_embedding = np.average( [doc_embedding, seed_embeddings[seed_index]], axis=0, weights=[3, 1] )
Sounds great! Thanks for the guide.
I will play with it and send a pull request.
Hi @MaartenGr ,
Thanks for your effort in KeyBERT. Super helpful!
I am extracting keywords using KeyBERT on GPU. In my case, each doc comes with a few seed keywords which differs among docs. I would like to expand the seed keywords so that I have more keywords for each doc.
Thanks to Guided KeyBERT, it can be done perfectly when feeding one doc and its keywords each time. Since the number of docs is quite large, I speed the extraction up with GPU, where batch feed is highly preferred as mentioned in KeyBERT FAQ. However, I notice that Guided KeyBERT works only when fed a single doc, but not for a batch of docs: when feeding a list of docs, as well as their seed keywords, it does not work, since
seed_keywords
is expected to be a string or a list of strings, not a nested list. Such as shown below.A sample error message:
It would be cool if enabling guided keyBERT for batch feed. My questions: