MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Online topic modeling vs Training on a subset for large datasets #1042

Closed lila-97 closed 1 year ago

lila-97 commented 1 year ago

Hi Marteen, thank you as usual for your unvaluable help.

I tried using online topic modeling on my 2million tweets dataset. Unfortunately, I believe using MiniBatchKMeans creates some problems - as I don't know how many clusters the model should be actually looking for. I've read that there is no way to use UMAP and HBDSCAN with online topic modeling, so I was wondering: which do you think is the best option for modeling the topics of such a large dataset? Is it better to use online topic modeling and try to find a value of K that reflects the characteristics of the dataset, or does it make more sense to just train the model on a subset of the data and then use it to predict topics on the rest of the dataset?

Just one additional information - this project is intended to study 1 of two classes in this dataset (pre-labeled through another analysis). If I decided to train the model on a subset of my data, would it make more sense to have a balanced subset between the two classes or to skew it heavily towards my class of interest? How will this affect the discovery of topics in the class that is NOT of interest?

MaartenGr commented 1 year ago

which do you think is the best option for modeling the topics of such a large dataset? Is it better to use online topic modeling and try to find a value of K that reflects the characteristics of the dataset, or does it make more sense to just train the model on a subset of the data and then use it to predict topics on the rest of the dataset?

There is not necessarily one that is better than the other but there are a couple of things that you can try out.

First, you can sample a, if possible, representative subset of the data and apply UMAP + HDBSCAN on that to get a feeling of the number of, large, topics in your data. There is a good chance that microclusters from the entire dataset will be ignored as it is highly unlikely that you can sample them all. Then, you can either transform all other documents or, if you figured out a good k value, apply online topic modeling to fit it on the entire dataset.

Second, there are GPU-accelerated versions of UMAP and HDBSCAN that you can use which should allow you to train on the entire dataset or, at the very least, a large subset of the data (e.g., 500_000 documents or even a million). You can find more about GPU-acceleration here.

Third, you can look into using River for your clustering algorithm as it allows for online learning and there are algorithms implemented that do not need a k value specified.

lila-97 commented 1 year ago

Thank you! I've tried the first method - my idea was to split my dataset according to time and train two different models (since I assume topics will be different 2010-2015 and 2016-2022, especially with Trumpism coming into the picture). Now I am transforming the 2016-2022 unsampled tweets using the model trained on a representative sample. However, I don't understand how to use <.get_topic_info()> for the transformed documents - since my final objective is to have the number of a topic associated to each tweet, I need this also for the docs that are just transformed, but if I try to use it it says "lengths don't match"... I assume this is because the embeddings have a different size from the ones to which I applied <.fit_transform()>, but I don't know how to solve it, can you help?

MaartenGr commented 1 year ago

I've tried the first method - my idea was to split my dataset according to time and train two different models (since I assume topics will be different 2010-2015 and 2016-2022, especially with Trumpism coming into the picture).

You could also train a single model that includes both splits. If the topics are entirely different from one another then the model should find them.

Now I am transforming the 2016-2022 unsampled tweets using the model trained on a representative sample. However, I don't understand how to use <.get_topic_info()> for the transformed documents - since my final objective is to have the number of a topic associated to each tweet, I need this also for the docs that are just transformed, but if I try to use it it says "lengths don't match"... I assume this is because the embeddings have a different size from the ones to which I applied <.fit_transform()>, but I don't know how to solve it, can you help?

Could you share your entire code for doing this? It is difficult to say what exactly is happening without seeing the code.

lila-97 commented 1 year ago

Hi Marteen, I slightly changed my strategy in the meantime. I keep not understanding what the correspondence is between docs and topics when using .transform()instead of .fit_transform() on my data.

I would like to end up with a version of .visualize_documents() that is separate for each class, in order to see any differences for my topics of interest. I already explored .visualize_topics_per_class(), but it does not give me a clear idea of how the topics cluster in each class, what the documents look like, etc...

I trained a BERTopic model called my_model on the whole representative subset (around 300k tweets), which has 0/1 labels on each tweet (0=not populist, 1= populist). The reason for not training two different models is that I only have 50k tweets for class 1 and it seemed wrong to isolate it when creating the topic representation (not sure this is actually the case).

Then, after training, I inspected the model and the representation was satisfying across each class. My idea of a workaround for the visualization is:

  1. Separate the dataframe by class

# separate dataframe by class

pop_df = new_df[new_df['V8_Bin']==1] nopop_df = new_df[new_df['V8_Bin']==0]

pop_df = pop_df.reset_index(drop=True) nopop_df = nopop_df.reset_index(drop=True)

  1. Create new embeddings for each class and transform only the docs from that class using their new embeddings (showing it only for one class here)

# predict pop class with trained model

sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings_pop = sentence_model.encode(pop_df['text_bertopic'], show_progress_bar=True)

topics, probs = my_model.transform(pop_df['text_bertopic'], embeddings_pop)

  1. Visualize each class' documents separately, highlighting topics of interest

Unfortunately, I can't seem to understand how to do step 3, for the same reasons I wasn't able to do what I asked before; I don't understand if, when using .transform(), we need to:

I already tried subsetting the docs and embeddings using indexing (you told me in another issue that the indexes match). However, it still gave me keyerror/indexerror when trying to run any visualization or method from BERTopic.

I am struggling with this specific task, but I believe understanding this issue better will definitely help with many future tasks. So thank you very much for your patience in replying, it's being a massive help!

lila-97 commented 1 year ago

To be even clearer: I know how to access doc/topic info (pasted the code below), but when using the methods included in the library there seem to be matching problems all the time with lengths, and I'm not quite sure how to avoid constantly running into them....

# visualize results

pop_df_topic = pd.DataFrame() pop_df_topic['tweet'] = pop_df['text_bertopic'] pop_df_topic['topic'] = pd.Series(topics) pop_df_topic['prob'] = pd.Series(probs)

pop_df_topic.head(30)

MaartenGr commented 1 year ago

Hi Marteen, I slightly changed my strategy in the meantime. I keep not understanding what the correspondence is between docs and topics when using .transform() instead of .fit_transform() on my data.

You use .fit_transform on all of your data to train the model and get, for each of your datapoints, the corresponding topics. .transform is used to get the topics for documents that the model was not trained on.

Unfortunately, I can't seem to understand how to do step 3, for the same reasons I wasn't able to do what I asked before; I don't understand if, when using .transform(), we need to: regenerate embeddings so they match the new length of docs? or not? update the model with the new topics? anything else?

If you want to create a .visualize_documents per class, you will need to do the following:

  1. Encode, using the embedding model, the documents per class
  2. For each class, use .transform to get the topics for all documents in each class
  3. Adjust the code here such that a custom list of documents and topics can be passed

To be even clearer: I know how to access doc/topic info (pasted the code below), but when using the methods included in the library there seem to be matching problems all the time with lengths, and I'm not quite sure how to avoid constantly running into them....

It is difficult to say without seeing your full and complete pipeline but in general, the visualizations are meant for the documents the model was trained on. If you want to visualize unseen documents, then you would have to adjust the visualizations that those are taken into account.

In other words, change the following line:

https://github.com/MaartenGr/BERTopic/blob/d665d3f8d8c7c1736dc82b1df8839ced56a2adb6/bertopic/plotting/_documents.py#L90

to be a passable parameter instead:

import numpy as np
import pandas as pd
import plotly.graph_objects as go

from umap import UMAP
from typing import List

def visualize_documents(topic_model,
                        docs: List[str],
                        topic_per_doc,
                        topics: List[int] = None,
                        embeddings: np.ndarray = None,
                        reduced_embeddings: np.ndarray = None,
                        sample: float = None,
                        hide_annotations: bool = False,
                        hide_document_hover: bool = False,
                        custom_labels: bool = False,
                        title: str = "<b>Documents and Topics</b>",
                        width: int = 1200,
                        height: int = 750):

    # Sample the data to optimize for visualization and dimensionality reduction
    if sample is None or sample > 1:
        sample = 1

    indices = []
    for topic in set(topic_per_doc):
        s = np.where(np.array(topic_per_doc) == topic)[0]
        size = len(s) if len(s) < 100 else int(len(s) * sample)
        indices.extend(np.random.choice(s, size=size, replace=False))
    indices = np.array(indices)

    df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
    df["doc"] = [docs[index] for index in indices]
    df["topic"] = [topic_per_doc[index] for index in indices]

    # Extract embeddings if not already done
    if sample is None:
        if embeddings is None and reduced_embeddings is None:
            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
        else:
            embeddings_to_reduce = embeddings
    else:
        if embeddings is not None:
            embeddings_to_reduce = embeddings[indices]
        elif embeddings is None and reduced_embeddings is None:
            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")

    # Reduce input embeddings
    if reduced_embeddings is None:
        umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit(embeddings_to_reduce)
        embeddings_2d = umap_model.embedding_
    elif sample is not None and reduced_embeddings is not None:
        embeddings_2d = reduced_embeddings[indices]
    elif sample is None and reduced_embeddings is not None:
        embeddings_2d = reduced_embeddings

    unique_topics = set(topic_per_doc)
    if topics is None:
        topics = unique_topics

    # Combine data
    df["x"] = embeddings_2d[:, 0]
    df["y"] = embeddings_2d[:, 1]

    # Prepare text and names
    if topic_model.custom_labels_ is not None and custom_labels:
        names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
    else:
        names = [f"{topic}_" + "_".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]

    # Visualize
    fig = go.Figure()

    # Outliers and non-selected topics
    non_selected_topics = set(unique_topics).difference(topics)
    if len(non_selected_topics) == 0:
        non_selected_topics = [-1]

    selection = df.loc[df.topic.isin(non_selected_topics), :]
    selection["text"] = ""
    selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), "Other documents"]

    fig.add_trace(
        go.Scattergl(
            x=selection.x,
            y=selection.y,
            hovertext=selection.doc if not hide_document_hover else None,
            hoverinfo="text",
            mode='markers+text',
            name="other",
            showlegend=False,
            marker=dict(color='#CFD8DC', size=5, opacity=0.5)
        )
    )

    # Selected topics
    for name, topic in zip(names, unique_topics):
        if topic in topics and topic != -1:
            selection = df.loc[df.topic == topic, :]
            selection["text"] = ""

            if not hide_annotations:
                selection.loc[len(selection), :] = [None, None, selection.x.mean(), selection.y.mean(), name]

            fig.add_trace(
                go.Scattergl(
                    x=selection.x,
                    y=selection.y,
                    hovertext=selection.doc if not hide_document_hover else None,
                    hoverinfo="text",
                    text=selection.text,
                    mode='markers+text',
                    name=name,
                    textfont=dict(
                        size=12,
                    ),
                    marker=dict(size=5, opacity=0.5)
                )
            )

    # Add grid in a 'plus' shape
    x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))
    y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))
    fig.add_shape(type="line",
                  x0=sum(x_range) / 2, y0=y_range[0], x1=sum(x_range) / 2, y1=y_range[1],
                  line=dict(color="#CFD8DC", width=2))
    fig.add_shape(type="line",
                  x0=x_range[0], y0=sum(y_range) / 2, x1=x_range[1], y1=sum(y_range) / 2,
                  line=dict(color="#9E9E9E", width=2))
    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)

    # Stylize layout
    fig.update_layout(
        template="simple_white",
        title={
            'text': f"{title}",
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
        },
        width=width,
        height=height
    )

Then, it would look something like this:

# Data
subset_docs = # Your 300k representative subset
docs_class_a = # All documents in class A
docs_class_b = # All documents in class B

# Embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_a = sentence_model.encode(docs_class_a)
embeddings_b = sentence_model.encode(docs_class_b)
embeddings_subset = sentence_model.encode(subset_docs)

# Training
topic_model = BERTopic().fit(subset_docs, embeddings_subset)
topics_a, _ = topic_model.transform(embeddings_a)
topics_b, _ = topic_model.transform(embeddings_b)

# Visualize results
visualize_documents(topic_model, docs_class_a, topics_a)
visualize_documents(topic_model, docs_class_b, topics_b)
lila-97 commented 1 year ago

Thank you so much! The solution is a perfect workaround for what I had in mind.

I initially thought of having both classes in the same graph, but since i set n_components= 5 in UMAP, I guess the distances in 2D would not have been particularly helpful anyways... From what I understood, lowering this parameter to 2 would improve the visualization of distances between data points, but it would hinder the clustering as not all information could be maintained when pushing everything to 2D?

MaartenGr commented 1 year ago

The 2D distance do give some information about the distance in 5D but it is indeed still an approximation. Personally, I would keep n_components at 5 and use the 2D visualizations for just that, visualization. It is a rough approximation of higher dimensionality and I have seen many applications where the 2D is seen as a perfect representation of 5D or higher, which is seldom the case.

lila-97 commented 1 year ago

Understood. Thanks again for your help, it's been crucial!