MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Align Logger Output with Documented Pipeline #2192

Open PipaFlores opened 1 month ago

PipaFlores commented 1 month ago

Feature request

Following some reflection based on a closed PR, I want to suggest a change in the logger.info messages to align with the documentation and modular pipeline of the library.

Context: In the current implementation, the code's logging verbosity reflects a simplified version of the pipeline:

Embeddings -> Dim. Red. -> Clustering -> Representation

However, in the documentation and associated research papers, the pipeline is presented with more detailed later stages:

Embeddings -> Dim. Red. -> Clustering -> Bag of Words (vectorizer + c-TF-IDF) -> Fine Tuning Representations

This mismatch between the actual implementation steps and the documented process can create confusion, especially for non-CS researchers (like myself) or contributors who rely on logging to understand the modularity and steps of the implementation.

Problem Description:

There’s ambiguity in the logging related to the term “Representation,” which might be interpreted differently depending on the pipeline step.

The pipeline section involving the c-TF-IDF is currently not logged at all, despite being a core part of the pipeline.

The logging output skips from:

BERTopic - Cluster- Clustering the reduced embeddings.
BERTopic - Cluster - Completed ✓
BERTopic - Representation - Extracting topics from clusters using representation models.
BERTopic - Representation - Completed ✓

Proposed Solution: To better reflect the detailed process, I propose to align it with the current documentation. A more detailed logging flow might look like:

Embeddings - Transforming documents to embeddings.
Dim. Red. - Fitting the dimensionality reduction algorithm.
Clusterc- Clustering the reduced embeddings.
Bag of Words - Calculating c-TF-IDF  (Vectorizer could be added here also)
Base Representation - Extracting top-n words per topic from c-TF-IDF
Fine Tuning Representation - Fine-tuning topics using representation models.
Fine Tuning Representation - Calculating different aspects.

Conceptual Clarification: It also comes to play that some concepts are still ambiguous an can be approached differently. For example, the definition of “topic” can be seen in two ways:

As such, the statement of "extracting topics from" is not a precise term, so I would avoid it in this new scheme, unless there is a clear cut point where topics are "defined".

Given that the c-TF-IDF step plays a foundational role in refining topic representations, I suggest treating it as a separate and essential component in the pipeline and reflecting this in the logger output.

Motivation

I want to improve this library, since I will probably be using during my research (already used it for a paper).

Your contribution

I can implement the solution, if agreed. I will probably start doing so for my own purposes, so a commit id will be available here for observation/comments when done.

MaartenGr commented 1 month ago

This mismatch between the actual implementation steps and the documented process can create confusion, especially for non-CS researchers (like myself) or contributors who rely on logging to understand the modularity and steps of the implementation.

Agreed. If things are unclear because I unnecessarily tried to simplify things, then we need to be explicit instead and more inline with the documentation.

Proposed Solution: To better reflect the detailed process, I propose to align it with the current documentation. A more detailed logging flow might look like:

Sounds good! I suggest we keep all logging before the representation steps and improve the logging at the representation level. I like separating the c-TF-IDF from additional representations (although c-TF-IDF can also be an additional representation/aspect).

Conceptual Clarification: It also comes to play that some concepts are still ambiguous an can be approached differently. For example, the definition of “topic” can be seen in two ways:

As the clustering output, where documents are grouped and assigned a topic ID (but there are no topic keywords defined). At the end of the c-TF-IDF step, where top-n-keywords provide a more concrete representation or “definition” of each topic.

I generally try to refer to a "topic" as the output of any representation, whether that is c-TF-IDF. Implicitly, that does contain the output of the clustering but is not sufficient to me. So technically, I would say at the end of the c-TF-IDF where I attempt to extract topics from clusters.

As such, the statement of "extracting topics from" is not a precise term, so I would avoid it in this new scheme, unless there is a clear cut point where topics are "defined".

But yeah, this definitely makes sense here since the definition of a "topic" has also shifted a bit with the introduction of additional representations, multimodal topic modeling, etc.

All in all, sounds good to me! Thank you for considering this and taking the time 😄

PipaFlores commented 3 weeks ago

So I made a draft now. The logging looks like

2024-10-28 10:18:43,031 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-10-28 10:18:54,653 - BERTopic - Dimensionality - Completed ✓ 2024-10-28 10:18:54,656 - BERTopic - Cluster - Start clustering the reduced embeddings 2024-10-28 10:18:54,669 - BERTopic - Cluster - Completed ✓ 2024-10-28 10:18:54,681 - BERTopic - Bag of Words - Updating Bag of Words for each topic 2024-10-28 10:18:54,682 - BERTopic - Bag of Words - Tokenization using vectorizer
2024-10-28 10:18:54,688 - BERTopic - Bag of Words - Setting multiplier for c-TF-IDF using seed words 2024-10-28 10:18:54,690 - BERTopic - Bag of Words - Calculating c-TF-IDF matrix 2024-10-28 10:18:54,706 - BERTopic - Representation - Fine-tuning topics using main representation model 2024-10-28 10:18:56,948 - BERTopic - Representation - Extracting additional topic aspects 2024-10-28 10:18:56,961 - BERTopic - Representation - Completed ✓

And it can get very verbose. For example, when reducing topics:

2024-10-28 10:18:57,046 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-10-28 10:18:59,178 - BERTopic - Dimensionality - Completed ✓ 2024-10-28 10:18:59,178 - BERTopic - Cluster - Start clustering the reduced embeddings 2024-10-28 10:18:59,190 - BERTopic - Cluster - Completed ✓ 2024-10-28 10:18:59,195 - BERTopic - Bag of Words - Updating Bag of Words for each topic 2024-10-28 10:18:59,196 - BERTopic - Bag of Words - Tokenization using vectorizer
2024-10-28 10:18:59,199 - BERTopic - Bag of Words - Setting multiplier for c-TF-IDF using seed words 2024-10-28 10:18:59,202 - BERTopic - Bag of Words - Calculating c-TF-IDF matrix 2024-10-28 10:18:59,216 - BERTopic - Representation - Extracting top words from c_tf_idf 2024-10-28 10:18:59,233 - BERTopic - Representation - Completed ✓ 2024-10-28 10:18:59,234 - BERTopic - Topic reduction - Reducing number of topics 2024-10-28 10:18:59,242 - BERTopic - Topic reduction - Auto-reduced number of topics from 22 to 3 2024-10-28 10:18:59,253 - BERTopic - Bag of Words - Updating Bag of Words for each topic 2024-10-28 10:18:59,254 - BERTopic - Bag of Words - Tokenization using vectorizer 2024-10-28 10:18:59,258 - BERTopic - Bag of Words - Setting multiplier for c-TF-IDF using seed words 2024-10-28 10:18:59,261 - BERTopic - Bag of Words - Calculating c-TF-IDF matrix 2024-10-28 10:18:59,277 - BERTopic - Representation - Fine-tuning topics using main representation model 2024-10-28 10:19:00,352 - BERTopic - Representation - Extracting additional topic aspects 2024-10-28 10:19:00,355 - BERTopic - Representation - Completed ✓

This may be an issue for you but for me is very handy, otherwise I would just set verbose to false. The logger is also aligned more closely to where the process is actually happening in the code. I did not modify the earlier stages though (embeddings, dim. red, and clustering)

MaartenGr commented 3 weeks ago

@pipa666 Thanks! There are a couple of things that I would prefer to be removed from the logging in order to prevent too much logging. I want to make sure there is a balance so that users also do not get information overload.

Let's start with this one:

2024-10-28 10:18:43,031 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-28 10:18:54,653 - BERTopic - Dimensionality - Completed ✓
2024-10-28 10:18:54,656 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-28 10:18:54,669 - BERTopic - Cluster - Completed ✓
2024-10-28 10:18:54,681 - BERTopic - Bag of Words - Updating Bag of Words for each topic
2024-10-28 10:18:54,682 - BERTopic - Bag of Words - Tokenization using vectorizer
2024-10-28 10:18:54,688 - BERTopic - Bag of Words - Setting multiplier for c-TF-IDF using seed words
2024-10-28 10:18:54,690 - BERTopic - Bag of Words - Calculating c-TF-IDF matrix
2024-10-28 10:18:54,706 - BERTopic - Representation - Fine-tuning topics using main representation model
2024-10-28 10:18:56,948 - BERTopic - Representation - Extracting additional topic aspects
2024-10-28 10:18:56,961 - BERTopic - Representation - Completed ✓

I would prefer to combine the first two Bag of Words loggers. Simply stating that the vectorizer is updated seems enough to me as there is only _preprocess_text between them. Also, and I believe you mentioned this, but since I'm using vectorizer_model, perhaps using Vectorizer instead of Bag of Words would make sense.

Moreover, I think I would prefer removing the "Setting multiplier for c-TF-IDF" since that is an almost instant procedure.

As a result, it would look like this:

2024-10-28 10:18:43,031 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-28 10:18:54,653 - BERTopic - Dimensionality - Completed ✓
2024-10-28 10:18:54,656 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-28 10:18:54,669 - BERTopic - Cluster - Completed ✓
2024-10-28 10:18:54,681 - BERTopic - Vectorizer - Calculating Bag-of-Words
2024-10-28 10:18:54,690 - BERTopic - c-TF-IDF - Calculating c-TF-IDF matrix
2024-10-28 10:18:54,706 - BERTopic - Representation - Fine-tuning topics using main representation model
2024-10-28 10:18:56,948 - BERTopic - Representation - Extracting additional topic aspects
2024-10-28 10:18:56,961 - BERTopic - Representation - Completed ✓

I'm also wondering whether we should add something like:

2024-10-28 10:18:54,681 - BERTopic - Vectorizer - Calculating Bag-of-Words
2024-10-28 10:18:54,669 - BERTopic - Vectorizer - Completed ✓
2024-10-28 10:18:54,690 - BERTopic - c-TF-IDF - Calculating c-TF-IDF matrix
2024-10-28 10:18:54,669 - BERTopic - c-TF-IDF - Completed ✓

but that, in turn, makes the logging quite a lot...

What do you think?

PipaFlores commented 3 weeks ago

I would prefer to combine the first two Bag of Words loggers. Simply stating that the vectorizer is updated seems enough to me as there is only _preprocess_text between them.

I agree, indeed I had some struggles on the vectorizer part, as its very simple and I dont think many people customize that part (I really don't know). Your suggestion makes more sense, since it aligns with the docs.

Regarding the "Setting multiplier for c-TF-IDF". I would insist in keeping it. In reality, this should not appear unless seed words are defined, and in iterative experiments it might be important to remember that this option is being used (as for most options that deviate from the base model). But maybe you see things different, ultimately is your decision as I can keep my logging preferences locally.

Regarding the Completed ✓ statements. I would include them as in your example. However I am biased towards more logging, so I don't see the overlogging as an issue.

On the other side, if we want to reduce the logging, then I would suggest removing these statements from everywhere. I assume that if any process is not completed, then a python error will raise, so the statements are a bit unnecessary. I like their aesthetic though.

MaartenGr commented 2 weeks ago

Regarding the "Setting multiplier for c-TF-IDF" . I would insist in keeping it. In reality, this should not appear unless seed words are defined, and in iterative experiments it might be important to remember that this option is being used (as for most options that deviate from the base model). But maybe you see things different, ultimately is your decision as I can keep my logging preferences locally.

Let's remove that part. I feel like this is something most users would not want.

Regarding the Completed ✓ statements. I would include them as in your example. However I am biased towards more logging, so I don't see the overlogging as an issue.

Let's skip this for now. I might remove the completed statements at some point since you can already understand how long something took by the additional logging that will now be added.

On the other side, if we want to reduce the logging, then I would suggest removing these statements from everywhere. I assume that if any process is not completed, then a python error will raise, so the statements are a bit unnecessary. I like their aesthetic though.

I had the same idea. I like the aesthetic and it helps break up individual pieces. I would say, let's leave them as is and not add more.