MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Fixed Issue: #1977 #2181

Open SSivakumar12 opened 1 month ago

SSivakumar12 commented 1 month ago

First of all, I want to take the opportunity to thank the awesome work being done by Martin and all contributors both past and present. This is a really good tool that we have used at work and has provided us with a lot of meaningful insights so I figured that I could do my bit to contribute to a tool that has helped me previously :)

This is my first PR and I sincerely hope it is not my last especially within this repo since I really want to give back to this project.

What does this PR do?

Address the issue raised in issues #1977 which I believe has not been resolved yet.

Fixes # (issue) I have added an edge case so that if a tokenizer isn't any of the expected patterns a ValueError is raised and produced a constructive error message to hopefully to prevent a repeat of the error. I did not update the documentation and add any tests. That being said, I am more than happy to add tests if deemed appropriate/valuable

Before submitting

MaartenGr commented 1 month ago

Thank you for the PR! Is this exception raised whenever the documents are actually truncated or when you instantiate the related LLM? If it is the former, you would get an error after the clustering has been done which seems like wasted resources for the user. Would it perhaps be an idea to raise this error earlier?

SSivakumar12 commented 1 month ago

Good point! In my case right now, the error would be raised when I instantiate the relevant LLM. I will look to try and find a suitable place to raise the earlier and amend accordingly.

SSivakumar12 commented 1 month ago

Looking at an example implementation for an representation class such as llamacpp, the purpose of truncate_document, is to prepare the prompt so that it could be sent to the LLM?

I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable?

MaartenGr commented 1 month ago

Looking at an example implementation for an representation class such as llamacpp, the purpose of truncate_document, is to prepare the prompt so that it could be sent to the LLM?

That is correct.

I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable?

I understand your reasoning. However, that would mean that this very basic issue only gets flagged after most of the computation (and potentially hours of training) has already been done. Moreover, this is something we could already know when you initialize a given LLM. So by moving this:

            raise ValueError(
                "Please select from one of the valid options for the `tokenizer` parameter: \n"
                "{'char', 'whitespace', 'vectorizer'} \n"
                "Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
            )

To the __init__ of all LLM-based representations. We could do something like this:

if tokenizer is None and doc_length is not None:
  raise ValueError(
      "Please select from one of the valid options for the `tokenizer` parameter: \n"
      "{'char', 'whitespace', 'vectorizer'} \n"
      "Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
  )

That will show the error the moment you actually create the LLM, so before any computation has been done and users can adjust accordingly.

MaartenGr commented 3 weeks ago

@SSivakumar12 Could you create the errors as a function that is imported from _utils.py? Now, we are duplicating the same code which makes it hard to manage the code.