MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Should raise an Exception when tokenizer is not defined #1977

Open timo-obrecht opened 6 months ago

timo-obrecht commented 6 months ago

In bertopic/representation/_utils.py, line 57, tokenizer is possibly None. In this case, an exception asking the user to explicitly set tokenizer should be raised.

    if doc_length is not None:
        if tokenizer == "char":
            truncated_document = document[:doc_length]
        elif tokenizer == "whitespace":
            truncated_document = " ".join(document.split()[:doc_length])
        elif tokenizer == "vectorizer":
            tokenizer = topic_model.vectorizer_model.build_tokenizer()
            truncated_document = " ".join(tokenizer(document)[:doc_length])
        elif hasattr(tokenizer, 'encode') and hasattr(tokenizer, 'decode'):
            encoded_document = tokenizer.encode(document)
            truncated_document = tokenizer.decode(encoded_document[:doc_length])
        return truncated_document
    return document
MaartenGr commented 6 months ago

Thanks for this! It would indeed be much nicer to raise an expectation and explain what the user should do. If you want, a PR would be much appreciated.

SSivakumar12 commented 1 month ago

Hi, I believe I have a PR for this if that is okay: #2181