Closed shengbo-ma closed 1 year ago
@shengbo-ma Thanks for the extensive description and the work you have put in with respect to testing, documentation, etc! I had not forgotten about this but was on holiday for a few weeks.
Will it be weird to pass docs as a flat list but pass seed_keywords as a nested list?
I agree that this is a bit of a double-edged sword. On the one hand, I want to keep the API intact as much as possible but having a flat list of seed_keywords
would be the most intuitive solution for this problem. The main difficulty here is that it is difficult to have global and local seed_keywords
co-exist.
Another solution to this might be keeping seed_keywords
as a flat list and using the length of seed_keywords
as an indicator of whether they are local or global keywords. For example, let's say you have the following seeded keywords:
list_of_seed_keywords = ["supervised", "label", "unsupervised"]
When you have hundreds of documents, it is quite clear that the above do not relate to each document. If you were to have three documents, however, then we might consider them to each relate to a specific document.
A disadvantage of this though is when you have n
keywords and n
documents but the keywords do not each relate to a specific document. Another disadvantage of using a flat approach is that you cannot separate individual keywords as they will need to be concatenated using a " ".join
before passing as a flat list. So additional processing will still be necessary.
The more I think about this, the more I am convinced that a nested approach is necessary, especially when you have multiple seeded keywords for a single document. This also ties in with the issues of using " ".join
which I mention a bit later.
Another solution allowing flat seed keywords is to add an extra parameter indicating whether each doc has its own seeds, but it is not an elegant design I suppose. Hope not go for it unless we have to.
Yep, KeyBERT is meant as a straightforward solution so this is indeed something I want to prevent as much as possible.
Another potential optimization I am considering is to average doc_embeddings and seed_embeddings in one matrix operation before iterating each doc embedding and calculating the score. we can simply move the seed keyword check from inside the iteration and integrate it extracting embedding snippet. An example code can be (see the TODOs)
That should indeed speed things up a bit and perhaps make things a bit more clear if the entire seeded KeyBERT process is put into a single place.
I thought I had removed the use of a " ".join
in most places but it seems that it is still here. The issue with using this is that there are several languages out there that do not necessarily use whitespace to indicate a new word. As such, joining those keywords in this way will not work for those languages. More specifically, although you can still create the embeddings, it would be more accurate if we did not use a " ".join
at all.
Instead, I believe we should create the embeddings for each seed keyword individually and then average them together. That way, the approach is language-independent. The change would be rather straightforward for a global (flat) seed keywords approach:
# Use this:
seed_embeddings = self.model.embed(seed_keywords).mean(axis=0).reshape(1, -1)
# Instead of this:
seed_embeddings = self.model.embed([" ".join(seed_keywords)])
The local (nested) approach can be done easily but that would not be as efficient as calculating everything at once:
# Use this:
seed_embeddings = np.vstack([self.model.embed(keywords).mean(axis=0).reshape(1, -1) for keywords in seed_keywords])
# Instead of this:
seed_embeddings = self.model.embed([" ".join(keywords) for keywords in seed_keywords])
I think, taking all the above into account, the solution might be more minimal than I had anticipated:
if seed_keywords is not None:
# Flat list of seed keywords
if isinstance(seed_keywords[0], str):
seed_embeddings = self.model.embed(seed_keywords).mean(axis=0)
doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])
# Nested list of seed keywords, one set for each document
else:
seed_embeddings = np.vstack([self.model.embed(keywords).mean(axis=0).reshape(1, -1) for keywords in seed_keywords])
doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])
This way, we prevent the use of " ".join
, the guided keyword processing is put in one location, and there is some speed up with respect to the processing of the keywords. The only speed up that is missing concerns the nested list of speed up. Here, I am still iterating over sets of keywords instead of processing them all at once. We can process them all at once but doing so requires a bit of overhead with respect to tracking the indices of the embeddings that belong to each set of keywords.
EDIT: There is a deprecation warning though when using the above method for a flat list of seed keywords:
Hi @MaartenGr Hope you had a great holiday :)
Thanks for response the open questions. Glad to see them clarified so that we could make progress without confusion.
The new implementation includes
extract_keywords()
is modified as you suggested in the last post. All tests passed.
This way, we prevent the use of " ".join, the guided keyword processing is put in one location, and there is some speed up with respect to the processing of the keywords. The only speed up that is missing concerns the nested list of speed up. Here, I am still iterating over sets of keywords instead of processing them all at once. We can process them all at once but doing so requires a bit of overhead with respect to tracking the indices of the embeddings that belong to each set of keywords.
There are two potential ways to get seed embeddings for local keywords
Which way is better depends on the tradeoff between embedding cost and flatten/split cost. Especially when running on GPU. For now, I implemented it in the second way as you suggested, which I think is super readable.
A detail worth mentioning in my implementation is about global keywords.
seed_keywords
is now taking care of 3 scenarios:
In case of global keywords, since multiple docs are averaging against the same seed embedding, the dimension does not match. For example, doc_embeddings
has a shape of (10, 384)
while seed_embeddings
has (1, 384)
(or (384,)
if without reshape(1,-1)
), when 10 docs are sharing seed keywords.
If we still want to use this code for weighted average
doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])
A deprecation warning is thrown in numpy version 1.21.6. (as mentioned in your post) A ValueError is thrown in numpy version 1.24.1. (current stable version)
Message of ValueError
[/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py](https://localhost:8080/#) in average(a, axis, weights, returned, keepdims)
507 [4.5]])
508 """
--> 509 a = np.asanyarray(a)
510
511 if keepdims is np._NoValue:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
Try this code snippet below for details
import numpy as np
print(np.__version__)
a = np.array([[1,1,1], [2,2,2]])
b = np.array([[10,10,10]]) # 1.24.1 throws error while 1.21.6 throws warning
# similarly if use b = np.array([10,10,10]): 1.24.1 throws error while 1.21.6 outputs no warning
print(a.shape, b.shape)
res = np.average([a, b], axis=0, weights=[3, 1])
print(res)
My current implementation is to np.repeat
the seed embedding row n
times, where n
is the count of docs. By doing this, we could still use doc_embeddings = np.average([doc_embeddings, seed_embeddings], axis=0, weights=[3, 1])
for averaging without modification.
However, np.repeat
may cost O(n)
space I suppose (not sure if numpy optimizes duplicate rows in any way). I am thinking if there is a more efficient yet elegant way, such as a simple API similar to np.average
which broadcasts seed embedding vector to each doc embedding vector for weighted average. But I did not find one.
Another option, of course, we could do broadcasting manually such as
doc_embeddings = (doc_embeddings * 3 + seed_embeddings[0]) / 4
It look less elegant though.
Edit:
Additional info on the numpy deprecation warning -
It is mentioned the release note of numpy v1.24.0 in Expired Deprecation: creating ragged array without explicitly passing dtype=object
is no longer a deprecation, but will raise ValueError
instead.
np.average([a, b])
calls np.asanyarray([a, b])
which raises ValueError
, since [a, b]
is a ragged array.
Which way is better depends on the tradeoff between embedding cost and flatten/split cost. Especially when running on GPU. For now, I implemented it in the second way as you suggested, which I think is super readable.
Great, sounds good to me!
Another option, of course, we could do broadcasting manually such as
doc_embeddings = (doc_embeddings * 3 + seed_embeddings[0]) / 4
It look less elegant though.
I think we can actually use this and tweak it a bit such that we do not need to use any np.repeat
whilst still keeping readability and a minimum number of lines of code:
# Guided KeyBERT either local (keywords shared among documents) or global (keywords per document)
if seed_keywords is not None:
if isinstance(seed_keywords[0], str):
seed_embeddings = self.model.embed(seed_keywords).mean(axis=0, keepdims=True)
elif len(docs) != len(seed_keywords):
raise ValueError("The length of docs must match the length of seed_keywords")
else:
seed_embeddings = np.vstack([
self.model.embed(keywords).mean(axis=0, keepdims=True)
for keywords in seed_keywords
])
doc_embeddings = ((doc_embeddings * 3 + seed_embeddings) / 4)
This allows for the following three scenarios:
Oh, and the docstrings only need to be added the way I did in the code above. It focuses on the general goal (Guided KeyBERT) and prevents going into too much detail about each if/else statement.
@MaartenGr Cool! I've push the latest code.
Hmm, it seems that python 3.6 should be dropped from the testing. Could you perhaps change it to 3.7 here:
After that, if it passes, I'll go ahead and merge it 😄
Updated the code.
@shengbo-ma Awesome, it passed! Thanks for the great and extensive work even including tests for this, it is greatly appreciated. I really liked the extensive descriptions you gave going in-depth with respect to the solutions and how they might fit with KeyBERT.
@MaartenGr Thanks! I had fun :) Glad that I can contribute to this awesome repo. Really enjoyed the communication.
This feature is proposed in Issue #151. Thanks for @MaartenGr for his guide code. This implementation is heavily based on it.
User Scenario
A user have multiples docs. Each doc has its own seed keywords to guide extraction.
Proposed Solution
Passing a nested list of keywords to
seed_keywords
. The length ofseed_keywords
is the same asdocs
.Example
Testing
Tests on this feature are added to
test_guided()
intests/test_model.py
.The tests passed in my local test.
Files Changed
seed_keywords
test_guided()
.DS_Store
Open Questions
Although all tests of this feature are passed in my local, I notice there remains some open questions which I did not find a solid answer.
Nested List?
Will it be weird to pass docs as a flat list but pass
seed_keywords
as a nested list?The solution of nested list definitely works, but I am wondering if there is a solution more intuitive to users.
I understand that a flat list of keywords indicates a shared set of keywords in current version of keybert, so we need to keep it compatible. Nested list seems an intuitive solution for seed keywords non-shared among docs.
However, it can be a bit confusing: in most scenarios users have a flat list of docs and a flat list of corresponding seed keywords, e.g. pandas dataframe where docs and seed keywords are two columns of datatype str. If
seed_keywords
has to be a nested list to work properly in this case, users have to convert it to be nested when callingkw_model.extract_keywords()
.Another solution allowing flat
seed keywords
is to add an extra parameter indicating whether each doc has its own seeds, but it is not an elegant design I suppose. Hope not go for it unless we have to.Performance?
I am not an expert on speeding up the embeddings. Please feel free to correct me if anything wrong below.
Extracting Embedding
My implementation is a bit different from the guide in Issue #151.
When extracting embeddings, my implementation is
Instead of
Since I suppose embedding all keywords at one time should be faster, compared to embedding them one by one.
Update doc embeddings
Another potential optimization I am considering is to average
doc_embeddings
andseed_embeddings
in one matrix operation before iterating each doc embedding and calculating the score. we can simply move the seed keyword check from inside the iteration and integrate it extracting embedding snippet. An example code can be (see the TODOs)Not sure how much difference it will make. Just something fun to consider.