MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Saving parameters and results to a log file #1760

Open buscon opened 9 months ago

buscon commented 9 months ago

I found useful to save the parameters and results to a log file.

I extended the BERTopic class to fit my needs, here you can have a look: https://github.com/buscon/fg_analysis_with_BERT/blob/main/classes/custom_log_bertopic.py It is a barebone implementation, it should be extended and refined.

If you @MaartenGr are interested to integrate such a feature in BERTopic, I can fork it and implement this feature inside the BERTopic structure and make a PR.

MaartenGr commented 9 months ago

Thanks for sharing this! At the moment, something like this is not yet on my internal roadmap due to the limited use case. Having said that, if there is enough interest from other users it definitely could be implemented. I'll leave this open to see the reaction of others.

zilch42 commented 9 months ago

Thanks @buscon for sharing your implementation. It was really interesting to read. I have some interest in logging too I thought I would add my 2 cents.

I'm not sure it's entirely necessary to add this to BERTopic. The one line that I think does benefit from being in the class is this one:

        self.logger.info(f"Initialized BERTopic with parameters: {args}, {kwargs}")

I am curious to know what it prints when you provide say your own umap model to BERTopic. Does it just give you the umap args and kwargs?

So maybe it would be worth adding that to the existing logging in BERTopic, but the other lines could just as easily go in your main script and may be pretty user specific.

        # Logging the end of the method and results
        self.logger.info("Completed fit_transform method")
        self.logger.info(f"Topics: {predictions}")
        self.logger.info(f"Topics: {self.get_topic_info()}")
        self.logger.info(f"Topic Names: {self.get_topic_info().Name}")
        self.logger.info(f"Probabilities: {self.probabilities_}")

I personally wouldn't really need to log the probabilities, and I'm not sure about logging the entire topic_info table, especially when I have hundreds of topics.

What I would like to know @MaartenGr is how to get the builtin BERTopic logger to output to file. I like the verbose logging because it records how long different stages are taking (I wouldn't mind a time for the CountVectorizer/CTFIDF stages btw, they take longer than UMAP with large lots of text), but I haven't been able to get it to output to file.

I use a function like the following to set up my logging. I get my own logs and those from sentence_transformers in debug.log but nothing from BERTopic, even though it does log to the console. Adding a FileHandler doesn't seem to work.

def setup_logging():
    """enable logging from internal modules and common packages"""

    logging.basicConfig(
        format="%(asctime)s [%(levelname)s] [%(module)s] %(message)s",
        handlers=[
            logging.FileHandler("debug.log", mode="w"),
            logging.StreamHandler(sys.stdout)
        ]
    )

    logging.getLogger().setLevel(logging.WARNING)
    logger = logging.getLogger("Notebook")
    logger.setLevel(logging.INFO)

    logging.getLogger("BERTopic").setLevel(logging.INFO)
    logging.getLogger("BERTopic").addHandler(logging.FileHandler("debug.log", mode="a"))
    logging.getLogger("sentence_transformers").setLevel(logging.INFO)
    logging.getLogger("sentence_transformers").addHandler(logging.FileHandler("debug.log", mode="a"))

    return logger

logger = setup_logging()
logger.info("Starting clustering pipeline...")
...

If we can get the built in BERTopic logs to file successfully then I think you can more or less leave it up to the user to log whatever else they want

MaartenGr commented 9 months ago

What I would like to know @MaartenGr is how to get the builtin BERTopic logger to output to file. I like the verbose logging because it records how long different stages are taking (I wouldn't mind a time for the CountVectorizer/CTFIDF stages btw, they take longer than UMAP with large lots of text), but I haven't been able to get it to output to file.

Sure, that should be relatively straightforward to implement (with respect to the CV/cTFIDF stages). I would have to check specifics but it should be possible to automatically output the logs of the model. However, I don't think it is the nicest experience for the user to automatically have logs created whenever they run BERTopic. Some sort of manual selection would be preferred here. What holds me back here is that the number of parameters is starting to get bigger and bigger to the point where it hurts user experience (as I already have experienced with the current set). So adding another parameter for something that is not core-functionality is something I want to prevent as much as possible.

buscon commented 9 months ago

What I would like to know @MaartenGr is how to get the builtin BERTopic logger to output to file. I like the verbose logging because it records how long different stages are taking (I wouldn't mind a time for the CountVectorizer/CTFIDF stages btw, they take longer than UMAP with large lots of text), but I haven't been able to get it to output to file.

Sure, that should be relatively straightforward to implement (with respect to the CV/cTFIDF stages). I would have to check specifics but it should be possible to automatically output the logs of the model. However, I don't think it is the nicest experience for the user to automatically have logs created whenever they run BERTopic. Some sort of manual selection would be preferred here. What holds me back here is that the number of parameters is starting to get bigger and bigger to the point where it hurts user experience (as I already have experienced with the current set). So adding another parameter for something that is not core-functionality is something I want to prevent as much as possible.

I totally understand your concern, I would not add it to the main BERTopic class too. But what about an extended class, something like BERTopicLogger ? Otherwise, an extra method of the BERTopic class which does the logging? That way the user can use it only when need it.

I'm not sure it's entirely necessary to add this to BERTopic. The one line that I think does benefit from being in the class is this one:

I agree with you, the most useful logging material are the BERTopic parameters. On the other hand, it would be good to have other outputs as parameters to be added. I did not do that in my current class yet.

zilch42 commented 9 months ago

Sorry for the late reply.

However, I don't think it is the nicest experience for the user to automatically have logs created whenever they run BERTopic.

What holds me back here is that the number of parameters is starting to get bigger and bigger to the point where it hurts user experience (as I already have experienced with the current set). So adding another parameter for something that is not core-functionality is something I want to prevent as much as possible.

Yes, totally agree. Sorry, I wasn't trying to suggest this be something that BERTopic automatically does, or be a feature you need to add. It's generally possible for a user to intercept and redirect the internal logger from a package by getting that internal logger and adding an extra handler to it. e.g.

logging.getLogger({package name}).addHandler({handler})

That's what I had been trying to do with BERTopic. I solved my own problem there today. I just had to add force=True when I ran logging.basicConfig() to reset the root logger (not exactly sure why, doesn't matter, all happy).

@buscon I think there is a better way to go about logging those parameters than incorporating it as a feature into BERTopic, which will give you (and other users) more control and not need to add to the feature space. In the example you linked to, you already have your umap and hdbscan parameters in a config file which is great. If you just add the BERTopic hyper parameters to that it makes them really easy to both log yourself, and pass to the required functions. E.g.

from umap import UMAP 
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
import logging
import sys

# set general logging
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[
        logging.FileHandler("my_log_file.log", mode="w"),
        logging.StreamHandler(sys.stdout)
    ],
    force=True
)

logging.getLogger().setLevel(logging.WARNING)
logger = logging.getLogger("Notebook")
logger.setLevel(logging.INFO)

# redirect logging for BERTopic
handler = logging.FileHandler("my_log_file.log", mode="a")
handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(message)s"))
logging.getLogger("BERTopic").setLevel(logging.INFO)
logging.getLogger("BERTopic").addHandler(handler)

# you can substitute this with your config.ini file
config = {
    'UMAP': {
        'n_neighbors': 17, 
        'n_components': 3, 
        'min_dist': 0.0
    },
    'BERTopic': {
        'top_n_words': 10, 
        'verbose': True, 
        'calculate_probabilities': True,
    }
    # etc...
}

logger.info(f"UMAP parameters: {config['UMAP']}")
logger.info(f"BERTopic hyper parameters: {config['BERTopic']}")

topic_model = BERTopic(
    umap_model=UMAP(**config['UMAP']),
    # etc...
    **config['BERTopic']
)

# get newsgroups data
data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data['data'][0:500]

topics, _ = topic_model.fit_transform(docs)

# log the probs, topics, whatever else you want

This results in a log file that looks like:

2024-01-29 13:38:11,887 - INFO - UMAP parameters: {'n_neighbors': 17, 'n_components': 3, 'min_dist': 0.0}
2024-01-29 13:38:11,888 - INFO - BERTopic hyper parameters: {'top_n_words': 10, 'verbose': True, 'calculate_probabilities': True}
2024-01-29 13:38:13,026 - INFO - Embedding - Transforming documents to embeddings.
2024-01-29 13:38:16,541 - INFO - Embedding - Completed ✓
2024-01-29 13:38:16,542 - INFO - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-29 13:38:21,283 - INFO - Dimensionality - Completed ✓
2024-01-29 13:38:21,285 - INFO - Cluster - Start clustering the reduced embeddings
2024-01-29 13:38:21,319 - INFO - Cluster - Completed ✓
2024-01-29 13:38:21,322 - INFO - Representation - Extracting topics from clusters using representation models.
2024-01-29 13:38:21,412 - INFO - Representation - Completed ✓

By keeping control of the logging on your side, you can tailor it to your use case and what you are actually interested in.