MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 337 forks source link

Failed to use KeyphraseCountVectorizer with Keybert 0.7.0 #141

Open homepsyc opened 1 year ago

homepsyc commented 1 year ago

KeyphraseCountVectorizer used to work with Keybert but now with 0.7.0 it failed with highlight_document():

def highlight_document(
    doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer
):
    keywords_only = [keyword for keyword, _ in keywords]
    max_len = vectorizer.ngram_range[1]

Exception says:

AttributeError: 'KeyphraseCountVectorizer' object has no attribute 'ngram_range'

which is expected for KeyphraseCountVectorizer.

Please kindly advise, thanks.

MaartenGr commented 1 year ago

Could you share your entire code for getting this error? Seeing your code and having a reproducible example would help in identifying what is exactly happening here.

homepsyc commented 1 year ago
from keyphrase_vectorizers import KeyphraseCountVectorizer
vectorizer = KeyphraseCountVectorizer()
keyword4 = model.extract_keywords(doc, vectorizer=vectorizer, highlight=True)

Screenshot from 2022-11-16 16-43-12

Thanks @MaartenGr

MaartenGr commented 1 year ago

Ah, it seems that since KeyphraseCountVectorizer discovers its own n-gram range and because of that it does not have a ngram_range attribute. Moreover, since it is technically not a scikit-learn CountVectorizer other function, like build_tokenizer() are missing. Unfortunately, there is no quick fix for this as it may require large changes on either side.