MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.47k stars 344 forks source link

Enable Guided KeyBERT for seed keywords differing among docs #152

Closed shengbo-ma closed 1 year ago

shengbo-ma commented 1 year ago

This feature is proposed in Issue #151. Thanks for @MaartenGr for his guide code. This implementation is heavily based on it.

User Scenario

A user have multiples docs. Each doc has its own seed keywords to guide extraction.

Proposed Solution

Passing a nested list of keywords to seed_keywords. The length of seed_keywords is the same as docs.

Example

list_of_docs = [
    """
        Supervised learning (SL) is a machine learning paradigm for problems 
        where the available data consists of labelled examples, 
        meaning that each data point contains features (covariates) and an associated label. 
    """,
    """
        Unsupervised learning is a type of algorithm that learns patterns from untagged data. 
        The hope is that through mimicry, which is an important mode of learning in people, 
        the machine is forced to build a concise representation of its world 
        and then generate imaginative content from it.
    """
]
list_of_seed_keywords = [["supervised", "label"], ["unsupervised"]]

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(list_of_docs, seed_keywords=list_of_seed_keywords)

Testing

Tests on this feature are added to test_guided() in tests/test_model.py.

The tests passed in my local test.

Files Changed

Open Questions

Although all tests of this feature are passed in my local, I notice there remains some open questions which I did not find a solid answer.

Nested List?

Will it be weird to pass docs as a flat list but pass seed_keywords as a nested list?

The solution of nested list definitely works, but I am wondering if there is a solution more intuitive to users.

I understand that a flat list of keywords indicates a shared set of keywords in current version of keybert, so we need to keep it compatible. Nested list seems an intuitive solution for seed keywords non-shared among docs.

However, it can be a bit confusing: in most scenarios users have a flat list of docs and a flat list of corresponding seed keywords, e.g. pandas dataframe where docs and seed keywords are two columns of datatype str. If seed_keywords has to be a nested list to work properly in this case, users have to convert it to be nested when calling kw_model.extract_keywords().

Another solution allowing flat seed keywords is to add an extra parameter indicating whether each doc has its own seeds, but it is not an elegant design I suppose. Hope not go for it unless we have to.

Performance?

I am not an expert on speeding up the embeddings. Please feel free to correct me if anything wrong below.

Extracting Embedding

My implementation is a bit different from the guide in Issue #151.

When extracting embeddings, my implementation is

seed_embeddings = self.model.embed([" ".join(keywords) for keywords in seed_keywords])

Instead of

seed_embeddings = [self.model.embed([" ".join(keywords)]) for keywords in seed_keywords]

Since I suppose embedding all keywords at one time should be faster, compared to embedding them one by one.

Update doc embeddings

Another potential optimization I am considering is to average doc_embeddings and seed_embeddings in one matrix operation before iterating each doc embedding and calculating the score. we can simply move the seed keyword check from inside the iteration and integrate it extracting embedding snippet. An example code can be (see the TODOs)

        # Extract embeddings
        if seed_keywords is not None:
            # str or flat list of str
            if isinstance(seed_keywords[0], str):
                seed_embeddings = self.model.embed([" ".join(seed_keywords)])
            # nested list of str
            else:
                if len(docs) != len(seed_keywords):
                    raise ValueError("The length of docs must match the length of seed_keywords")
                else:
                    seed_embeddings = self.model.embed([" ".join(keywords) for keywords in seed_keywords])

            # TODO: update doc_embedings here. Something like
            # doc_embedding = np.average(
            #      [doc_embeddings, seed_embeddings], axis=0, weights=[3, 1]
            # )

        # Find keywords
        all_keywords = []
        for index, _ in enumerate(docs):

            try:
                # Select embeddings
                candidate_indices = df[index].nonzero()[1]
                candidates = [words[index] for index in candidate_indices]
                candidate_embeddings = word_embeddings[candidate_indices]
                doc_embedding = doc_embeddings[index].reshape(1, -1)

                # TODO: skip this seed check
                # # Guided KeyBERT with seed keywords
                # if seed_keywords is not None:
                #     seed_index = slice(None) if isinstance(seed_keywords[0], str) else index
                #    doc_embedding = np.average(
                #         [doc_embedding, seed_embeddings[seed_index].reshape(1, -1)], axis=0, weights=[3, 1]
                #     )

                # Rest of the codes ...

Not sure how much difference it will make. Just something fun to consider.

MaartenGr commented 1 year ago

@shengbo-ma Thanks for the extensive description and the work you have put in with respect to testing, documentation, etc! I had not forgotten about this but was on holiday for a few weeks.

Nested List vs Flat List

Will it be weird to pass docs as a flat list but pass seed_keywords as a nested list?

I agree that this is a bit of a double-edged sword. On the one hand, I want to keep the API intact as much as possible but having a flat list of seed_keywords would be the most intuitive solution for this problem. The main difficulty here is that it is difficult to have global and local seed_keywords co-exist.

Another solution to this might be keeping seed_keywords as a flat list and using the length of seed_keywords as an indicator of whether they are local or global keywords. For example, let's say you have the following seeded keywords:

list_of_seed_keywords = ["supervised", "label", "unsupervised"]

When you have hundreds of documents, it is quite clear that the above do not relate to each document. If you were to have three documents, however, then we might consider them to each relate to a specific document.

A disadvantage of this though is when you have n keywords and n documents but the keywords do not each relate to a specific document. Another disadvantage of using a flat approach is that you cannot separate individual keywords as they will need to be concatenated using a " ".join before passing as a flat list. So additional processing will still be necessary.

The more I think about this, the more I am convinced that a nested approach is necessary, especially when you have multiple seeded keywords for a single document. This also ties in with the issues of using " ".join which I mention a bit later.

Another solution allowing flat seed keywords is to add an extra parameter indicating whether each doc has its own seeds, but it is not an elegant design I suppose. Hope not go for it unless we have to.

Yep, KeyBERT is meant as a straightforward solution so this is indeed something I want to prevent as much as possible.

Averaging before Iterating

Another potential optimization I am considering is to average doc_embeddings and seed_embeddings in one matrix operation before iterating each doc embedding and calculating the score. we can simply move the seed keyword check from inside the iteration and integrate it extracting embedding snippet. An example code can be (see the TODOs)

That should indeed speed things up a bit and perhaps make things a bit more clear if the entire seeded KeyBERT process is put into a single place.

The issue of " ".join

I thought I had removed the use of a " ".join in most places but it seems that it is still here. The issue with using this is that there are several languages out there that do not necessarily use whitespace to indicate a new word. As such, joining those keywords in this way will not work for those languages. More specifically, although you can still create the embeddings, it would be more accurate if we did not use a " ".join at all.

Instead, I believe we should create the embeddings for each seed keyword individually and then average them together. That way, the approach is language-independent. The change would be rather straightforward for a global (flat) seed keywords approach:

# Use this:
seed_embeddings = self.model.embed(seed_keywords).mean(axis=0).reshape(1, -1)

# Instead of this:
seed_embeddings = self.model.embed([" ".join(seed_keywords)])

The local (nested) approach can be done easily but that would not be as efficient as calculating everything at once:

# Use this:
seed_embeddings = np.vstack([self.model.embed(keywords).mean(axis=0).reshape(1, -1) for keywords in seed_keywords])

# Instead of this:
seed_embeddings = self.model.embed([" ".join(keywords) for keywords in seed_keywords])

Combining it all together

I think, taking all the above into account, the solution might be more minimal than I had anticipated:

if seed_keywords is not None:
    # Flat list of seed keywords
    if isinstance(seed_keywords[0], str):
        seed_embeddings = self.model.embed(seed_keywords).mean(axis=0)
        doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])
    # Nested list of seed keywords, one set for each document
    else:
        seed_embeddings = np.vstack([self.model.embed(keywords).mean(axis=0).reshape(1, -1) for keywords in seed_keywords])
        doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])

This way, we prevent the use of " ".join, the guided keyword processing is put in one location, and there is some speed up with respect to the processing of the keywords. The only speed up that is missing concerns the nested list of speed up. Here, I am still iterating over sets of keywords instead of processing them all at once. We can process them all at once but doing so requires a bit of overhead with respect to tracking the indices of the embeddings that belong to each set of keywords.

EDIT: There is a deprecation warning though when using the above method for a flat list of seed keywords:

image

shengbo-ma commented 1 year ago

Hi @MaartenGr Hope you had a great holiday :)

Thanks for response the open questions. Glad to see them clarified so that we could make progress without confusion.

Updates

The new implementation includes

extract_keywords() is modified as you suggested in the last post. All tests passed.

Discussion

Embed Local Keywords & Performance Issue

This way, we prevent the use of " ".join, the guided keyword processing is put in one location, and there is some speed up with respect to the processing of the keywords. The only speed up that is missing concerns the nested list of speed up. Here, I am still iterating over sets of keywords instead of processing them all at once. We can process them all at once but doing so requires a bit of overhead with respect to tracking the indices of the embeddings that belong to each set of keywords.

There are two potential ways to get seed embeddings for local keywords

Which way is better depends on the tradeoff between embedding cost and flatten/split cost. Especially when running on GPU. For now, I implemented it in the second way as you suggested, which I think is super readable.

Global Keyword & Deprecation Warning Issue

A detail worth mentioning in my implementation is about global keywords. seed_keywords is now taking care of 3 scenarios:

In case of global keywords, since multiple docs are averaging against the same seed embedding, the dimension does not match. For example, doc_embeddings has a shape of (10, 384) while seed_embeddings has (1, 384) (or (384,) if without reshape(1,-1)), when 10 docs are sharing seed keywords.

If we still want to use this code for weighted average

doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])

A deprecation warning is thrown in numpy version 1.21.6. (as mentioned in your post) A ValueError is thrown in numpy version 1.24.1. (current stable version)

Message of ValueError

[/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py](https://localhost:8080/#) in average(a, axis, weights, returned, keepdims)
    507            [4.5]])
    508     """
--> 509     a = np.asanyarray(a)
    510 
    511     if keepdims is np._NoValue:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Try this code snippet below for details

import numpy as np
print(np.__version__)

a = np.array([[1,1,1], [2,2,2]])
b = np.array([[10,10,10]]) # 1.24.1 throws error while 1.21.6 throws warning
# similarly if use b = np.array([10,10,10]): 1.24.1 throws error while 1.21.6 outputs no warning
print(a.shape, b.shape)
res = np.average([a, b], axis=0, weights=[3, 1])
print(res)

My current implementation is to np.repeat the seed embedding row n times, where n is the count of docs. By doing this, we could still use doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1]) for averaging without modification.

However, np.repeat may cost O(n) space I suppose (not sure if numpy optimizes duplicate rows in any way). I am thinking if there is a more efficient yet elegant way, such as a simple API similar to np.average which broadcasts seed embedding vector to each doc embedding vector for weighted average. But I did not find one. Another option, of course, we could do broadcasting manually such as

doc_embeddings = (doc_embeddings * 3 + seed_embeddings[0]) / 4

It look less elegant though.

Edit: Additional info on the numpy deprecation warning - It is mentioned the release note of numpy v1.24.0 in Expired Deprecation: creating ragged array without explicitly passing dtype=object is no longer a deprecation, but will raise ValueError instead. np.average([a, b]) calls np.asanyarray([a, b]) which raises ValueError, since [a, b] is a ragged array.

MaartenGr commented 1 year ago

Which way is better depends on the tradeoff between embedding cost and flatten/split cost. Especially when running on GPU. For now, I implemented it in the second way as you suggested, which I think is super readable.

Great, sounds good to me!

Another option, of course, we could do broadcasting manually such as

doc_embeddings = (doc_embeddings * 3 + seed_embeddings[0]) / 4

It look less elegant though.

I think we can actually use this and tweak it a bit such that we do not need to use any np.repeat whilst still keeping readability and a minimum number of lines of code:

# Guided KeyBERT either local (keywords shared among documents) or global (keywords per document)
if seed_keywords is not None:
    if isinstance(seed_keywords[0], str):
        seed_embeddings = self.model.embed(seed_keywords).mean(axis=0, keepdims=True)    
    elif len(docs) != len(seed_keywords):
        raise ValueError("The length of docs must match the length of seed_keywords")
    else:
        seed_embeddings = np.vstack([
            self.model.embed(keywords).mean(axis=0, keepdims=True)
            for keywords in seed_keywords
        ])
    doc_embeddings = ((doc_embeddings * 3 + seed_embeddings) / 4)

This allows for the following three scenarios:

Oh, and the docstrings only need to be added the way I did in the code above. It focuses on the general goal (Guided KeyBERT) and prevents going into too much detail about each if/else statement.

shengbo-ma commented 1 year ago

@MaartenGr Cool! I've push the latest code.

MaartenGr commented 1 year ago

Hmm, it seems that python 3.6 should be dropped from the testing. Could you perhaps change it to 3.7 here:

https://github.com/MaartenGr/KeyBERT/blob/7b763ae76ddbca56c0d139dd368ce907cee9dd30/.github/workflows/testing.yml#L18

After that, if it passes, I'll go ahead and merge it 😄

shengbo-ma commented 1 year ago

Updated the code.

MaartenGr commented 1 year ago

@shengbo-ma Awesome, it passed! Thanks for the great and extensive work even including tests for this, it is greatly appreciated. I really liked the extensive descriptions you gave going in-depth with respect to the solutions and how they might fit with KeyBERT.

shengbo-ma commented 1 year ago

@MaartenGr Thanks! I had fun :) Glad that I can contribute to this awesome repo. Really enjoyed the communication.