MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 337 forks source link

Changing the sequence of docs passed to KeyBERT returns different results for each doc #169

Closed Pratik--Patel closed 1 year ago

Pratik--Patel commented 1 year ago

The result of KeyBERT doesn't seem to be deterministic when we change the order of documents that are passed to it. I have created a reproducible example as follows.

from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import json
from flair.embeddings import TransformerDocumentEmbeddings
from typing import Iterable, List, Tuple, Dict, Optional

vectorizer = KeyphraseCountVectorizer()
embeddingModel = TransformerDocumentEmbeddings('sentence-transformers/all-MiniLM-L6-v2')
model = KeyBERT(model=embeddingModel)

docs_1 = ["an apple a day keeps doctor away","strawberry is a good fruit","microsoft acquired openai","openai provides AI powered tools and APIs"]

#same set of documents but in different order
docs_2 = ["openai provides AI powered tools and APIs","microsoft acquired openai","strawberry is a good fruit","an apple a day keeps doctor away"]

result_1 = model.extract_keywords(
            docs=docs_1,
            vectorizer=vectorizer,
            top_n=10
        )
print(json.dumps(result_1, indent=2))

result_2 = model.extract_keywords(
            docs=docs_2,
            vectorizer=vectorizer,
            top_n=10
        )
print(json.dumps(result_2, indent=2))

The output is as follows

result_1 for docs_1

[
  [
    [
      "apple",
      0.8074
    ],
    [
      "doctor",
      0.7767
    ],
    [
      "day",
      0.6991
    ]
  ],
  [
    [
      "strawberry",
      0.9463
    ]
  ],
  [],
  [
    [
      "ai",
      0.8443
    ],
    [
      "apis",
      0.775
    ],
    [
      "tools",
      0.7478
    ]
  ]
]

result_2 for docs_2

[
  [
    [
      "openai",
      0.9083
    ],
    [
      "ai",
      0.8443
    ],
    [
      "tools",
      0.7478
    ]
  ],
  [
    [
      "openai",
      0.9262
    ]
  ],
  [
    [
      "good fruit",
      0.8823
    ]
  ],
  [
    [
      "apple",
      0.8074
    ],
    [
      "doctor",
      0.7767
    ],
    [
      "day",
      0.6991
    ]
  ]
]

As we can see, for the text strawberry is a good fruit, strawberry is extracted in result_1 whereas good fruit is extracted in result_2.

Any idea why this might be happening?

MaartenGr commented 1 year ago

Could you try it without KeyphraseCountVectorizer? Perhaps there is something happening with the tokenizer there.

Pratik--Patel commented 1 year ago

Thanks for swift reply. Not using KeyphraseCountVectorizer does seem to fix the issue but the quality of key phrases seems to be affected significantly.

Following is the code without KeyphraseCountVectorizer and corresponding results.

result_1 = model.extract_keywords(docs_1, keyphrase_ngram_range=(1, 3), stop_words='english')

result

[
  [
    [
      "apple day",
      0.83
    ],
    [
      "apple day keeps",
      0.8135
    ],
    [
      "apple",
      0.8074
    ],
    [
      "day keeps doctor",
      0.8011
    ],
    [
      "keeps doctor",
      0.7826
    ]
  ],
  [
    [
      "strawberry good fruit",
      0.9904
    ],
    [
      "strawberry good",
      0.9584
    ],
    [
      "strawberry",
      0.9463
    ],
    [
      "fruit",
      0.8886
    ],
    [
      "good fruit",
      0.8823
    ]
  ],
  [
    [
      "microsoft acquired openai",
      1.0
    ],
    [
      "acquired openai",
      0.9409
    ],
    [
      "openai",
      0.9262
    ],
    [
      "microsoft",
      0.8121
    ],
    [
      "microsoft acquired",
      0.8052
    ]
  ],
  [
    [
      "openai provides ai",
      0.972
    ],
    [
      "openai provides",
      0.9226
    ],
    [
      "openai",
      0.9083
    ],
    [
      "ai powered tools",
      0.855
    ],
    [
      "provides ai powered",
      0.8526
    ]
  ]
]

Many of the key phrases are sub set of some big key phrase. And then we have keywords like keep doctor which are not much meaningful. Is there a way to include POS features or improve the quality? Ussing MMR and Max Sum Distance does not seem to help much. Thanks again for your help!

MaartenGr commented 1 year ago

Currently, the only way to include POS features is by customizing the tokenizer in the CountVectorizer. That is where, in a way, the choice of candidate tokens is made. It is also the same process as the KeyphraseCountVectorizer is currently doing. It might also just be a bug, so posting an issue there might be worthwhile to do.

Pratik--Patel commented 1 year ago

Thanks, will follow it up there.