MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 337 forks source link

Guided KeyBERT for a list of docs #151

Closed shengbo-ma closed 1 year ago

shengbo-ma commented 1 year ago

Hi @MaartenGr ,

Thanks for your effort in KeyBERT. Super helpful!

I am extracting keywords using KeyBERT on GPU. In my case, each doc comes with a few seed keywords which differs among docs. I would like to expand the seed keywords so that I have more keywords for each doc.

Thanks to Guided KeyBERT, it can be done perfectly when feeding one doc and its keywords each time. Since the number of docs is quite large, I speed the extraction up with GPU, where batch feed is highly preferred as mentioned in KeyBERT FAQ. However, I notice that Guided KeyBERT works only when fed a single doc, but not for a batch of docs: when feeding a list of docs, as well as their seed keywords, it does not work, since seed_keywords is expected to be a string or a list of strings, not a nested list. Such as shown below.

list_of_docs = [
    """
        Supervised learning (SL) is a machine learning paradigm for problems 
        where the available data consists of labelled examples, 
        meaning that each data point contains features (covariates) and an associated label. 
    """,
    """
        Unsupervised learning is a type of algorithm that learns patterns from untagged data. 
        The hope is that through mimicry, which is an important mode of learning in people, 
        the machine is forced to build a concise representation of its world 
        and then generate imaginative content from it.
    """
]
kw_model = KeyBERT()
list_of_seed_keywords = [["supervised"], ["unsupervised"]]
keywords = kw_model.extract_keywords(list_of_docs, seed_keywords=list_of_seed_keywords)

A sample error message:

TypeError                                 Traceback (most recent call last)
<ipython-input-10-6cd1c475e80b> in <module>
     14 kw_model = KeyBERT()
     15 list_of_seed_keywords = [["supervised"], ["unsupervised"]]
---> 16 keywords = kw_model.extract_keywords(list_of_docs, seed_keywords=list_of_seed_keywords)

~/.virtualenvs/keybert/lib/python3.9/site-packages/keybert/_model.py in extract_keywords(self, docs, candidates, keyphrase_ngram_range, stop_words, top_n, min_df, use_maxsum, use_mmr, diversity, nr_candidates, vectorizer, highlight, seed_keywords, doc_embeddings, word_embeddings)
    178             word_embeddings = self.model.embed(words)
    179         if seed_keywords is not None:
--> 180             seed_embeddings = self.model.embed([" ".join(seed_keywords)])
    181 
    182         # Find keywords

TypeError: sequence item 0: expected str instance, list found

It would be cool if enabling guided keyBERT for batch feed. My questions:

MaartenGr commented 1 year ago

Thank you for your kind words! The seed_keywords parameter is used to define a set of keywords for which you would like the documents to be guided towards. Although this parameter can be used for batch documents, it is only the case if you want the documents to be guided towards a single set of terms, instead of a set of terms that differs for each document. It would definitely be nice if the latter was enabled.

I am pretty interested and willing to see if I could contribute. Is there suggestion how pull requests should be made?

Feel free to open up a PR. Simplicity within KeyBERT is key, if it is something that can be implemented in a few straightforward lines, then that would be preferred. We are aiming here for minimal and elegant implementations.

With respect to the implementation, it might be helpful to automatically detect whether seed_keywords contains a nested list and/or if it is the same length as the documents. So perhaps something like this:

# Extract embeddings
if seed_keywords is not None:
    if isinstance(seed_keywords[0], str):
        seed_embeddings = self.model.embed([" ".join(seed_keywords)])
    else:
        seed_embeddings = [self.model.embed([" ".join(keywords)]) for keywords in seed_keywords]

# Guided KeyBERT with seed keywords
if seed_keywords is not None:
    if isinstance(seed_keywords[0], str):
        doc_embedding = np.average(
            [doc_embedding, seed_embeddings], axis=0, weights=[3, 1]
        )
    else:
        doc_embedding = np.average(
            [doc_embedding, seed_embeddings[index]], axis=0, weights=[3, 1]
        )

# Guided KeyBERT with seed keywords <- This might be a more elegant solution
if seed_keywords is not None:
    seed_index = index if isinstance(seed_keywords[0], str) else slice(None)
    doc_embedding = np.average(
        [doc_embedding, seed_embeddings[seed_index]], axis=0, weights=[3, 1]
    )
shengbo-ma commented 1 year ago

Thank you for your kind words! The seed_keywords parameter is used to define a set of keywords for which you would like the documents to be guided towards. Although this parameter can be used for batch documents, it is only the case if you want the documents to be guided towards a single set of terms, instead of a set of terms that differs for each document. It would definitely be nice if the latter was enabled.

I am pretty interested and willing to see if I could contribute. Is there suggestion how pull requests should be made?

Feel free to open up a PR. Simplicity within KeyBERT is key, if it is something that can be implemented in a few straightforward lines, then that would be preferred. We are aiming here for minimal and elegant implementations.

With respect to the implementation, it might be helpful to automatically detect whether seed_keywords contains a nested list and/or if it is the same length as the documents. So perhaps something like this:

# Extract embeddings
if seed_keywords is not None:
    if isinstance(seed_keywords[0], str):
        seed_embeddings = self.model.embed([" ".join(seed_keywords)])
    else:
        seed_embeddings = [self.model.embed([" ".join(keywords)]) for keywords in seed_keywords]

# Guided KeyBERT with seed keywords
if seed_keywords is not None:
    if isinstance(seed_keywords[0], str):
        doc_embedding = np.average(
            [doc_embedding, seed_embeddings], axis=0, weights=[3, 1]
        )
    else:
        doc_embedding = np.average(
            [doc_embedding, seed_embeddings[index]], axis=0, weights=[3, 1]
        )

# Guided KeyBERT with seed keywords <- This might be a more elegant solution
if seed_keywords is not None:
    seed_index = index if isinstance(seed_keywords[0], str) else slice(None)
    doc_embedding = np.average(
        [doc_embedding, seed_embeddings[seed_index]], axis=0, weights=[3, 1]
    )

Sounds great! Thanks for the guide.

I will play with it and send a pull request.