MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.47k stars 345 forks source link

contains Duplicated keywords #93

Open Sarathy-R opened 2 years ago

Sarathy-R commented 2 years ago

Hi Team,

Firstof all , Kudos to your work !!

The keyBERT model produces duplicate keywords such as "Optimize" and "Optiization". Could you please help me in resolving this issue?

MaartenGr commented 2 years ago

Thank you for your kind words. It is very likely that similar keywords will be in the output as KeyBERT simply searches for the best matching keywords. To prevent this from happening, I would use using MMR and increasing the diversity parameter. The higher the diversity, the less likely you will get duplicate keywords. Make sure though not to set it too high as that will result in lower quality keywords. Finding a good balance for your use case is key here.

krlng commented 2 years ago

I realized that the diversification gives completely different results for multiple documents, depending on if you pass them in together as a list or one-by-one in a loop:

keywords = [kw_model.extract_keywords(doc, top_n=6, use_mmr=True) for doc in docs] gives for example:

[[('nlp', 0.4305), ('supercomputer', 0.2911), ('dimensionality', 0.2567), ('transformer', 0.2811), ('tokenization', 0.3741), ('multilingual', 0.3407)], [('tools', 0.3889), ('pyforest', 0.3587), ('workflow', 0.3246), ('technical', 0.3832), ('errors', 0.2197), ('development', 0.3828)], [('notebooks', 0.5781), ('documentation', 0.3977), ('jupyter', 0.4037), ('json', 0.2959), ('executable', 0.1036), ('implementations', 0.3206)]]

While: keywords = kw_model.extract_keywords(docs, top_n=6, use_mmr=True) gives:

[[('learning', 0.3452), ('language', 0.3585), ('languages', 0.3601), ('semantic', 0.3711), ('tokenization', 0.3741), ('nlp', 0.4305)], [('pyforest', 0.3587), ('developers', 0.3759), ('technology', 0.3769), ('development', 0.3828), ('technical', 0.3832), ('tools', 0.3889)], [('ipython', 0.3598), ('documentation', 0.3977), ('jupytext', 0.4018), ('jupyter', 0.4037), ('notebook', 0.5437), ('notebooks', 0.5781)]]

I guess if you hand them in as a list, it will diversify them across all words and thereby accept way more similarity. Not sure if this is a bug, but at least I would have expected something different and first thought, it would not work at all.

MaartenGr commented 2 years ago

Yes, when passing multiple documents at once we create word embeddings for the entire vocabulary and then apply cosine similarity between the word embeddings and the document embeddings to quickly find related keywords. This is quite fast for large documents but slows down significantly when applying MMR which still needs some optimization.

Thus, when using multiple documents, MMR is actually not working. To use MMR or Max Sum similarity, I would advise iterating over all documents and then extracting keywords. It is not the cleanest solution and I might change it in the future as it can be quite confusing.

akshaydhok07 commented 2 years ago

For large documents, truncation will be there at the time of document embeddings creations. This will not match the embeddings of keywords from the truncated part with document embedding, resulting in poor keywords. So is there any way to apply keybert on large documents?

MaartenGr commented 2 years ago

@akshaydhok07 Truncation depends on the embedding model that you are using. Sentence-transformers typically handle sentence-to-paragraph documents quite well but indeed truncate longer documents. Instead, using an embedding model like Doc2Vec might be more helpful in your use case. It might even be interesting to look at the longformer which can handle longer documents.

One other thing you can do is simply slice the documents into paragraphs and extract keywords from those. If you expect different keywords per paragraph, then slicing would be a good option.

adityashukzy commented 1 year ago

@MaartenGr Firstly, thank you so much for your brilliant work. It helps out the entire community. Kudos!

Do you have a tutorial/guide on how one would go about doing what you have suggested here? I'm currently using all-MiniLM-L6-v2 as well, and am experiencing the same issue of the embeddings being created after truncating the excess input text towards the end.

While I'm also exploring other avenues for embeddings such as Doc2Vec or LongFormer, I would like to try out this approach mentioned by you: namely, splitting a long document up into paragraphs, extracting keywords out from each, and averaging out to derive a singular overall doc_embedding.

Would the procedure be something as follows?

  1. Break the document up into paragraphs.
  2. For each paragraph, encode the paragraph text to get its embeddings.
  3. Repeat for each paragraph.
  4. Collate all these paragraph-level embeddings into a list and run np.mean() on it.

Also, how would you compare this approach to using Flair instead (i.e. DocumentPoolEmbeddings) which would presumably perform this at a word-level rather than a paragraph-level?

My use case is of extracting keywords and keyphrases out from 7-10 page research papers, which often run into about 2000-5000 words.