Closed minniekabra closed 2 years ago
When running multiple documents, the words across all documents are being used which might result in different keywords being created. I am currently working on an update to fix this, among others, but it will take some more time before it is ready to be released.
Thanks Maarten.
I will also highlight another issue which I faced when I was extracting keywords for a document by passing that document in a batch -
Eg: document_text='I am a firm believer of equality for everyone, especially gender-based one.'
After providing relevant pos_pattern, Extracted keywords which I received when passed in a batch -
{'am a firm believer of equality for everyone':0.78, 'firm believer of equality for everyone' :0.70, 'gender-based one' :0.60}
You can see that an extracted phrase is a part of another extracted phrase - 'firm believer of equality for everyone', part of 'am a firm believer of equality for everyone'
This doesn't happen when I passed a document in solo (i.e., not in any batch)
Extracted keywords which I received when passed in solo -
{'am a firm believer of equality for everyone':0.78, 'gender-based one' :0.60}
Questions -
- Can we correct this thing in current code of extract_keywords by tweaking some parameters (when passing at a batch level)
- If above cannot be done, do you plan to dix this issue?
- By when, the above mentioned issue & this issue will be rectified? And how will we get to know about it?
Currently, it is advised to feed the model one document at a time and iterate over that. Although you can feed it multiple documents at once, some features, like MMR, are not enabled and might indeed provide different results. Multiple documents is a bit faster but is a very simple baseline compared to iteratively feeding it one document.
I am currently working on making sure that both options provide the same results but it will take some more time. Most likely, this will make KeyBERT a bit slower processing multiple documents at once since it will also support MMR. I will make sure to send a message here when the new version is released.
Sure Maarten, will wait for the correction.
@minniekabra With the new release, you can now use MMR with either a single doc or multiple docs. The output should be the same regardless!
@minniekabra With the new release, you can now use MMR with either a single doc or multiple docs. The output should be the same regardless!
Hello @MaartenGr, Even after updates, is it still possible to generate a single set of keywords/keyphrases based on a set of documents? For example, eight keywords/keyphrases concerning a batch of 32 texts. Could you share a code snippet?
@celsofranssa If I am not mistaken, I believe you are referring to candidate keywords. A set of keywords that you want to be extracted from a number of documents. To do so, I think following this guide should help you out. In practice, I think it will look a bit like this:
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(docs, candidates=my_single_set_of_keywords_and_keyphrases)
Almost it, with the difference that I would like to find the keywords/keyphrases regard to the whole set of texts and not by text. By the way, is there a parameter for the maximum text length?
I believe you would then need to concatenate the documents into a single document and pass them to KeyBERT. Do note though that you would need an embedding model that supports long documents for this to work properly. Instead, I would advise going document by document and simply taking the top n overall keywords.
I believe you would then need to concatenate the documents into a single document and pass them to KeyBERT. Do note though that you would need an embedding model that supports long documents for this to work properly. Instead, I would advise going document by document and simply taking the top n overall keywords.
Great, I am going to do as you recommended. Thank you very much, and congratulations on the KeyBERT!
For a document, Extracted keywords differ when the document is passed in solo v/s it is passed in batch of multiple documents (batch size=64)
And more keywords are extracted when it is passed in solo v/s in batch. Do you know why it is happening?
I am using below code to extract keywords from a document.
from keybert import KeyBERT from keyphrase_vectorizers import KeyphraseTfidfVectorizer
kcv = KeyphraseTfidfVectorizer(pos_pattern = '<VB.>+<.>{0,2}<R.|J.>+<.>{0,2}<N.>+|<N.|VB.>+<.>{0,2}<N.>+|<J.|N.>+<R.>+<.>{0,2}<J.><N.>+|<R.>+<.>{0,2}<VB.|J.|N.>+|<VB.>+<.>{0,2}<N.|R.>+')#|<J.|N.>+')
kw_model = KeyBERT(model='ProsusAI/finbert') kw_model.extract_keywords(txt, vectorizer=kcv, top_n=100, stop_words=None, use_mmr=False)