Why extracted keywords are different for a document when it is processed solo v/s in a batch of multiple documents?

MaartenGr / KeyBERT

Minimal keyword extraction with BERT

https://MaartenGr.github.io/KeyBERT/

MIT License

3.55k stars 351 forks source link

Why extracted keywords are different for a document when it is processed solo v/s in a batch of multiple documents? #118

Closed minniekabra closed 2 years ago

minniekabra commented 2 years ago

For a document, Extracted keywords differ when the document is passed in solo v/s it is passed in batch of multiple documents (batch size=64)

And more keywords are extracted when it is passed in solo v/s in batch. Do you know why it is happening?

I am using below code to extract keywords from a document.

from keybert import KeyBERT from keyphrase_vectorizers import KeyphraseTfidfVectorizer

kcv = KeyphraseTfidfVectorizer(pos_pattern = '<VB.>+<.>{0,2}<R.|J.>+<.>{0,2}<N.>+|<N.|VB.>+<.>{0,2}<N.>+|<J.|N.>+<R.>+<.>{0,2}<J.><N.>+|<R.>+<.>{0,2}<VB.|J.|N.>+|<VB.>+<.>{0,2}<N.|R.>+')#|<J.|N.>+')

kw_model = KeyBERT(model='ProsusAI/finbert') kw_model.extract_keywords(txt, vectorizer=kcv, top_n=100, stop_words=None, use_mmr=False)

MaartenGr commented 2 years ago

When running multiple documents, the words across all documents are being used which might result in different keywords being created. I am currently working on an update to fix this, among others, but it will take some more time before it is ready to be released.

minniekabra commented 2 years ago

Thanks Maarten.

I will also highlight another issue which I faced when I was extracting keywords for a document by passing that document in a batch -

Eg: document_text='I am a firm believer of equality for everyone, especially gender-based one.'

After providing relevant pos_pattern, Extracted keywords which I received when passed in a batch -

{'am a firm believer of equality for everyone':0.78, 'firm believer of equality for everyone' :0.70, 'gender-based one' :0.60}

You can see that an extracted phrase is a part of another extracted phrase - 'firm believer of equality for everyone', part of 'am a firm believer of equality for everyone'

This doesn't happen when I passed a document in solo (i.e., not in any batch)

Extracted keywords which I received when passed in solo -

{'am a firm believer of equality for everyone':0.78, 'gender-based one' :0.60}

Questions -

Can we correct this thing in current code of extract_keywords by tweaking some parameters (when passing at a batch level)
If above cannot be done, do you plan to dix this issue?
By when, the above mentioned issue & this issue will be rectified? And how will we get to know about it?

MaartenGr commented 2 years ago

Can we correct this thing in current code of extract_keywords by tweaking some parameters (when passing at a batch level)

If above cannot be done, do you plan to dix this issue?

By when, the above mentioned issue & this issue will be rectified? And how will we get to know about it?

Currently, it is advised to feed the model one document at a time and iterate over that. Although you can feed it multiple documents at once, some features, like MMR, are not enabled and might indeed provide different results. Multiple documents is a bit faster but is a very simple baseline compared to iteratively feeding it one document.

I am currently working on making sure that both options provide the same results but it will take some more time. Most likely, this will make KeyBERT a bit slower processing multiple documents at once since it will also support MMR. I will make sure to send a message here when the new version is released.

minniekabra commented 2 years ago

Sure Maarten, will wait for the correction.

MaartenGr commented 2 years ago

@minniekabra With the new release, you can now use MMR with either a single doc or multiple docs. The output should be the same regardless!

celsofranssa commented 2 years ago

@minniekabra With the new release, you can now use MMR with either a single doc or multiple docs. The output should be the same regardless!

Hello @MaartenGr, Even after updates, is it still possible to generate a single set of keywords/keyphrases based on a set of documents? For example, eight keywords/keyphrases concerning a batch of 32 texts. Could you share a code snippet?

MaartenGr commented 2 years ago

@celsofranssa If I am not mistaken, I believe you are referring to candidate keywords. A set of keywords that you want to be extracted from a number of documents. To do so, I think following this guide should help you out. In practice, I think it will look a bit like this:

from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(docs, candidates=my_single_set_of_keywords_and_keyphrases)

celsofranssa commented 2 years ago

Almost it, with the difference that I would like to find the keywords/keyphrases regard to the whole set of texts and not by text. By the way, is there a parameter for the maximum text length?

MaartenGr commented 2 years ago

I believe you would then need to concatenate the documents into a single document and pass them to KeyBERT. Do note though that you would need an embedding model that supports long documents for this to work properly. Instead, I would advise going document by document and simply taking the top n overall keywords.

celsofranssa commented 2 years ago

I believe you would then need to concatenate the documents into a single document and pass them to KeyBERT. Do note though that you would need an embedding model that supports long documents for this to work properly. Instead, I would advise going document by document and simply taking the top n overall keywords.

Great, I am going to do as you recommended. Thank you very much, and congratulations on the KeyBERT!