MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.1k stars 762 forks source link

HowTo / Feature #347

Closed aph61 closed 2 years ago

aph61 commented 2 years ago

You have two versions for BERTopic, the fully integrated with a lot of additional methods, and a step-by-step in the colab "Topic with BERT.ipynb" I use the latter code to make an interactive 3D color image from the standard BERTopic().visualize_topics. You can't run them separately as the topic IDs will not match, and if you have very small topic sizes the topic content won't match. I don't wish to loose the BERTopic methods (very handy) either Is there a way to get the data (embeddings) after the intermediate steps SentenceEmbedding, UMAP, and HDBSCAN in BERTopic()? I couldn't find it as a method; where to start implementing something like dump_umap, dump_word, dump_hdbscan

thanks

Andreas

"Topic with BERT.ipynb" : https://colab.research.google.com/drive/1-SOw0WHZ_ZXfNE36KUe3Z-UpAO3vdhGg?usp=sharing#scrollTo=b8MBLqisBezg)

MaartenGr commented 2 years ago

That is currently not possible within BERTopic. Those are intermediate results that are not saved as BERTopic would become a bit too big. Having said that, you can try to expose them as follows:

from bertopic import BERTopic

class BERTopicNew(BERTopic):
    def fit_transform(self,
                      documents: List[str],
                      embeddings: np.ndarray = None,
                      y: Union[List[int], np.ndarray] = None) -> Tuple[List[int],
                                                                       Union[np.ndarray, None]]:
        """ Fit the models on a collection of documents, generate topics, and return the docs with topics
        Arguments:
            documents: A list of documents to fit on
            embeddings: Pre-trained document embeddings. These can be used
                        instead of the sentence-transformer model
            y: The target class for (semi)-supervised modeling. Use -1 if no class for a
               specific instance is specified.
        Returns:
            predictions: Topic predictions for each documents
            probabilities: The probability of the assigned topic per document.
                           If `calculate_probabilities` in BERTopic is set to True, then
                           it calculates the probabilities of all topics across all documents
                           instead of only the assigned topic. This, however, slows down
                           computation and may increase memory usage.
        Usage:
        ```python
        from bertopic import BERTopic
        from sklearn.datasets import fetch_20newsgroups
        docs = fetch_20newsgroups(subset='all')['data']
        topic_model = BERTopic()
        topics, probs = topic_model.fit_transform(docs)
    If you want to use your own embeddings, use it as follows:
    ```python
    from bertopic import BERTopic
    from sklearn.datasets import fetch_20newsgroups
    from sentence_transformers import SentenceTransformer
    # Create embeddings
    docs = fetch_20newsgroups(subset='all')['data']
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # Create topic model
    topic_model = BERTopic()
    topics, probs = topic_model.fit_transform(docs, embeddings)
    ```
    """
    check_documents_type(documents)
    check_embeddings_shape(embeddings, documents)

    documents = pd.DataFrame({"Document": documents,
                              "ID": range(len(documents)),
                              "Topic": None})

    # Extract embeddings
    if embeddings is None:
        self.embedding_model = select_backend(self.embedding_model,
                                              language=self.language)
        embeddings = self._extract_embeddings(documents.Document,
                                              method="document",
                                              verbose=self.verbose)
        logger.info("Transformed documents to Embeddings")
    else:
        if self.embedding_model is not None:
            self.embedding_model = select_backend(self.embedding_model,
                                                  language=self.language)

    # Reduce dimensionality with UMAP
    if self.seed_topic_list is not None and self.embedding_model is not None:
        y, embeddings = self._guided_topic_modeling(embeddings)
    umap_embeddings = self._reduce_dimensionality(embeddings, y)

    # Cluster UMAP embeddings with HDBSCAN
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)

    # Sort and Map Topic IDs by their frequency
    if not self.nr_topics:
        documents = self._sort_mappings_by_frequency(documents)

    # Extract topics by calculating c-TF-IDF
    self._extract_topics(documents)

    # Reduce topics
    if self.nr_topics:
        documents = self._reduce_topics(documents)

    self._map_representative_docs(original_topics=True)
    probabilities = self._map_probabilities(probabilities, original_topics=True)
    predictions = documents.Topic.to_list()

    ###################################
    ### CHANGES MADE HERE FOR aph61 ###
    ###################################
    self.sentence_embeddings = embeddings
    self.umap_embeddings = umap_embeddings

    return predictions, probabilities

In the code above I created a new BERTopic class named `BERTopicNew` that you can use instead of BERTopic. In the most lower part of the code, you will see that I assigned the embeddings that you look for to the `self.` variables as a way to expose them. 

In practice, you can run it like this:

```python
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopicNew()
topics, probs = topic_model.fit_transform(docs)

and then access the embeddings through:

topic_model.sentence_embeddings

and

topic_model.umap_embeddings

The above steps are meant for taking the BERTopic package and exposing the sentence embeddings and the embeddings after they have been reduced in dimensionality through UMAP. Does this answer your question?

aph61 commented 2 years ago

Hi Maarten,

Thanks, I'll give it a shot (with a small dataset)

I'll keep you posted on it's success

best,

Andreas

On Thu, Nov 25, 2021 at 2:38 PM Maarten Grootendorst < @.***> wrote:

That is currently not possible within BERTopic. Those are intermediate results that are not saved as BERTopic would become a bit too big. Having said that, you can try to expose them as follows:

from bertopic import BERTopic class BERTopicNew(BERTopic): def fit_transform(self, documents: List[str], embeddings: np.ndarray = None, y: Union[List[int], np.ndarray] = None) -> Tuple[List[int], Union[np.ndarray, None]]: """ Fit the models on a collection of documents, generate topics, and return the docs with topics Arguments: documents: A list of documents to fit on embeddings: Pre-trained document embeddings. These can be used instead of the sentence-transformer model y: The target class for (semi)-supervised modeling. Use -1 if no class for a specific instance is specified. Returns: predictions: Topic predictions for each documents probabilities: The probability of the assigned topic per document. If calculate_probabilities in BERTopic is set to True, then it calculates the probabilities of all topics across all documents instead of only the assigned topic. This, however, slows down computation and may increase memory usage. Usage: python from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups docs = fetch_20newsgroups(subset='all')['data'] topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs) If you want to use your own embeddings, use it as follows: python from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups from sentence_transformers import SentenceTransformer # Create embeddings docs = fetch_20newsgroups(subset='all')['data'] sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=True) # Create topic model topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs, embeddings) """ check_documents_type(documents) check_embeddings_shape(embeddings, documents)

    documents = pd.DataFrame({"Document": documents,
                              "ID": range(len(documents)),
                              "Topic": None})

    # Extract embeddings
    if embeddings is None:
        self.embedding_model = select_backend(self.embedding_model,
                                              language=self.language)
        embeddings = self._extract_embeddings(documents.Document,
                                              method="document",
                                              verbose=self.verbose)
        logger.info("Transformed documents to Embeddings")
    else:
        if self.embedding_model is not None:
            self.embedding_model = select_backend(self.embedding_model,
                                                  language=self.language)

    # Reduce dimensionality with UMAP
    if self.seed_topic_list is not None and self.embedding_model is not None:
        y, embeddings = self._guided_topic_modeling(embeddings)
    umap_embeddings = self._reduce_dimensionality(embeddings, y)

    # Cluster UMAP embeddings with HDBSCAN
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)

    # Sort and Map Topic IDs by their frequency
    if not self.nr_topics:
        documents = self._sort_mappings_by_frequency(documents)

    # Extract topics by calculating c-TF-IDF
    self._extract_topics(documents)

    # Reduce topics
    if self.nr_topics:
        documents = self._reduce_topics(documents)

    self._map_representative_docs(original_topics=True)
    probabilities = self._map_probabilities(probabilities, original_topics=True)
    predictions = documents.Topic.to_list()

    ###################################
    ### CHANGES MADE HERE FOR aph61 ###
    ###################################
    self.sentence_embeddings = embeddings
    self.umap_embeddings = umap_embeddings

    return predictions, probabilities

In the code above I created a new BERTopic class named BERTopicNew that you can use instead of BERTopic. In the most lower part of the code, you will see that I assigned the embeddings that you look for to the self. variables as a way to expose them.

In practice, you can run it like this:

from sklearn.datasets import fetch_20newsgroups docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] topic_model = BERTopicNew()topics, probs = topic_model.fit_transform(docs)

and then access the embeddings through:

topic_model.sentence_embeddings

and

topic_model.umap_embeddings

The above steps are meant for taking the BERTopic package and exposing the sentence embeddings and the embeddings after they have been reduced in dimensionality through UMAP. Does this answer your question?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/347#issuecomment-979178742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFO2PW6TP7QHWHOX3PQL36LUNYU57ANCNFSM5IYGSGRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

MaartenGr commented 2 years ago

Due to inactivity, I am going to close this issue. If it did not work with the small dataset, let me know! I'll reopen it and we can discuss it further.

aph61 commented 2 years ago

Hi Maarten

It took a while before I had time to work on it, but last Nov. I asked if there's a way to obtain the sentence and umap embeddings from BERTopic. (Use case: have the you get from the standard BERTopic in a 3D image)

You modified the code (see below), but when compiling I got the following error:


NameError Traceback (most recent call last)

in 5 from bertopic import BERTopic 6 ----> 7 class BERTopicNew(BERTopic): 8 def fit_transform(self, 9 documents: List[str], in BERTopicNew() 7 class BERTopicNew(BERTopic): 8 def fit_transform(self, ----> 9 documents: List[str], 10 embeddings: np.ndarray = None, 11 y: Union[List[int], np.ndarray] = None) -> Tuple[List[int], NameError: name 'List' is not defined

I guess it will also give an error for "Union". When I change "documents: List[str]," into "documents", things do not improve. I implemented the class in a separate cell in jupyter notebook

The way I wish to use the code is as follows

sentence_model = SentenceTransformer(language_model)
topic_model = BERTopicNew(embedding_model=sentence_model,
                       n_gram_range=(1,1),
                       min_topic_size=min_topic_size,
                       calculate_probabilities=True,
                       verbose=True)
topic_model.sentence_embeddings()
topic_model.umap_embeddings()

How do I tackle this?

Thanks,

Andreas

MaartenGr commented 2 years ago

There are a bunch of imports that you also need to do in order to get the class working:

import numpy as np
import pandas as pd
from typing import List, Union, Tuple, 
from bertopic.backend._utils import select_backend
from bertopic._utils import MyLogger, check_documents_type, check_embeddings_shape, check_is_fitted
from bertopic import BERTopic

class BERTopicNew(BERTopic):
    def fit_transform(self,
                      documents: List[str],
                      embeddings: np.ndarray = None,
                      y: Union[List[int], np.ndarray] = None) -> Tuple[List[int],
                                                                       Union[np.ndarray, None]]:
        """ Fit the models on a collection of documents, generate topics, and return the docs with topics
        Arguments:
            documents: A list of documents to fit on
            embeddings: Pre-trained document embeddings. These can be used
                        instead of the sentence-transformer model
            y: The target class for (semi)-supervised modeling. Use -1 if no class for a
               specific instance is specified.
        Returns:
            predictions: Topic predictions for each documents
            probabilities: The probability of the assigned topic per document.
                           If `calculate_probabilities` in BERTopic is set to True, then
                           it calculates the probabilities of all topics across all documents
                           instead of only the assigned topic. This, however, slows down
                           computation and may increase memory usage.
        Usage:
        ```python
        from bertopic import BERTopic
        from sklearn.datasets import fetch_20newsgroups
        docs = fetch_20newsgroups(subset='all')['data']
        topic_model = BERTopic()
        topics, probs = topic_model.fit_transform(docs)
    If you want to use your own embeddings, use it as follows:
    ```python
    from bertopic import BERTopic
    from sklearn.datasets import fetch_20newsgroups
    from sentence_transformers import SentenceTransformer
    # Create embeddings
    docs = fetch_20newsgroups(subset='all')['data']
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # Create topic model
    topic_model = BERTopic()
    topics, probs = topic_model.fit_transform(docs, embeddings)
    ```
    """
    check_documents_type(documents)
    check_embeddings_shape(embeddings, documents)

    documents = pd.DataFrame({"Document": documents,
                              "ID": range(len(documents)),
                              "Topic": None})

    # Extract embeddings
    if embeddings is None:
        self.embedding_model = select_backend(self.embedding_model,
                                              language=self.language)
        embeddings = self._extract_embeddings(documents.Document,
                                              method="document",
                                              verbose=self.verbose)
        logger.info("Transformed documents to Embeddings")
    else:
        if self.embedding_model is not None:
            self.embedding_model = select_backend(self.embedding_model,
                                                  language=self.language)

    # Reduce dimensionality with UMAP
    if self.seed_topic_list is not None and self.embedding_model is not None:
        y, embeddings = self._guided_topic_modeling(embeddings)
    umap_embeddings = self._reduce_dimensionality(embeddings, y)

    # Cluster UMAP embeddings with HDBSCAN
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)

    # Sort and Map Topic IDs by their frequency
    if not self.nr_topics:
        documents = self._sort_mappings_by_frequency(documents)

    # Extract topics by calculating c-TF-IDF
    self._extract_topics(documents)

    # Reduce topics
    if self.nr_topics:
        documents = self._reduce_topics(documents)

    self._map_representative_docs(original_topics=True)
    probabilities = self._map_probabilities(probabilities, original_topics=True)
    predictions = documents.Topic.to_list()

    ###################################
    ### CHANGES MADE HERE FOR aph61 ###
    ###################################
    self.sentence_embeddings = embeddings
    self.umap_embeddings = umap_embeddings

    return predictions, probabilities

Then, you should use the code as follows:

```python
topic_model = BERTopicNew()
topics, probs = topic_model.fit_transform(docs)
sentence_embeddings = topic_model.sentence_embeddings
umap_embeddings = topic_model.umap_embeddings
aph61 commented 2 years ago

I got the code working, thanks. There was some initial hickup that I could track down to an earlier release of BERTopic, but that's solved. There remained two minor issues

# --- AHe, changes
self._map_representative_docs()
# self._map_representative_docs(original_topics=True)
# --- AHe, changes
probabilities = self._map_probabilities(probabilities)
# probabilities = self._map_probabilities(probabilities, original_topics=True)

The ._map_representative_documents and ._map_probabilities do not have the parameter "original_topics=". I "solved" that by taking the default value

I use the extra code to generate a 3D map. That is easier to interpret (it's interactive), and also shows the cluster shape.

#
# --- Calculate the 3D projection of the clusters
#
umap_projection = umap.UMAP(n_neighbors=15,
                            n_components=3,
                            min_dist=0.0, 
                            metric='cosine').fit_transform(sentence_embeddings)

though I think that making the 3D projection of the 5D projection in the BERTopic.fit_transform is more correct.

Ultimately I create a df_result df_results[["keyword", "size", "weights", "documents", "text", "doc_info", "probabilities"]] where ["doc_info"] contains topic, size and document_id

I use this to create a df_plot

min_topicsize = 500
max_topicsize = 10000
df_plot = df_results[(df_results["topic_id"] != -1)
                     &((df_results["size"] > min_topicsize)
                       &(df_results["size"] < max_topicsize))].copy(deep=True)

that I plot using plotly (see code below). The final result is in the attached html (content removed)

I'm still working on a method for better keywords (right now 1-grams are used) using spacy, but so far this works nicely. The plot code

import plotly.express as px
ranges = []
filename = maarten + "_3D-doc-topic.html"

fig = px.scatter_3d(df_plot, x='x', y='y', z='z',
                    color='topic_id',
                    size_max=0.05,
                    hover_data={'x':False,
                                'y':False,
                                'z':False,
                                'topic_id': False,
                                "doc info ": [" " + doct for doct in df_plot["doc_info"].tolist()],
                                'keywords ': [" " + keyw for keyw in df_plot["keyword"].tolist()],
                                'text ': [" " + text for text in df_plot["text"].tolist()]
                               }
                   )

fig.update_traces(marker=dict(size=1),
                    selector=dict(mode='markers'))
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
title = "Topics (> " + str(min_topicsize) + " < " + str(min_topicsize) " documents)"
fig.update_layout(title_text=title)

fig.write_html(os.path.join(results_dir, filename))
fig.show()
MaartenGr commented 2 years ago

Glad to hear that it worked out! Let me know if you have any other issues.

aph61 commented 2 years ago

Hi Maarten,

I don't know if this belongs in the github/issues, but below is a piece of code I use in conjunction with BERTopic to improve topic characterization. If you find it useful I can put it in the github as "suggestion"

Below the rationale and code (and I hope it's understandable)

best,

Andreas


I've worked with BERTopic now for a bit, but for my use case a minus was the topic characterization: BERTopic generates words, not lemmas. As a result "hotels" and "hotel" are two different things, and n-grams like "San_Fransisco" do not exist. I wrote (hacked) some code that generates the bigram for "autonomous_car" for "autonomous cars", and "Donald_J_Trump_Jr" for "Donald J. Trump Jr." (the latter as a difficult example only). I've tested the lemmatization in a product environment. The algorithm is based on the fact that sequential PROPN-compound-PROPN are nearly always bigrams, and ADJ-amod-NOUN are often bigrams (it requires some counting on relevance: "autonomous_car" is a useful bigram, "yellow_car" isn't)

Below the "code". As said, it works, but it's ugly

import re
import spacy

sent =\
"""
Paris Hilton  and autonomous cars in the World Championship Football played
on  Sunday
in New York; Donald J. Trump Jr. was also attending
"""

nlp = spacy.load("en_core_web_sm")

def generate_ngram(line, nlp=None, allowed_pos=["NOUN", "PROPN"]):
    #
    # --- generates (all) potential ngrams in a sentence. Using
"allowed_pos" gives
    #     better content selection
    #
    if nlp == None:
        return line

    max_loop = 4
    line = line + " EOL" # prevents PROPN-loop if sentence ends at two
PROPNs
    lst_sent = [re.sub("[ ]{2,}", " ", re.sub("[\-\.]+", " ", line))]
    ind_sent = 0
    end_gram = False
    while ind_sent < max_loop and not end_gram:
        doc = nlp(lst_sent[ind_sent])
        print([(token.lemma_, token.pos_) for token in
nlp(lst_sent[ind_sent])])
        lst_word = []
        ind_word = 0
        while ind_word < len(doc)-1:
            if doc[ind_word].pos_ == "PROPN" and doc[ind_word+1].pos_ ==
"PROPN":
                # print(doc[ind_word].text + "_" + doc[ind_word+1].text)
                lst_word.append(doc[ind_word].text + "_" +
doc[ind_word+1].text)
                ind_word += 2
            else:
                if doc[ind_word].text.count("_") == 0:
                    if doc[ind_word].pos_ == "ADJ"\
                    and doc[ind_word+1].pos_ == "NOUN"\
                    and doc[ind_word].dep_ == "amod":
                        lst_word.append("amod_" + doc[ind_word].text + "_"
+ doc[ind_word+1].text)
                        ind_word += 2
                    else:
                        lst_word.append(doc[ind_word].text)
                        ind_word +=1
                else:
                    lst_word.append(doc[ind_word].text)
                    ind_word +=1

        lst_word.append(doc[-1].text)

        lst_sent.append(re.sub("EOL", ".", " ".join(lst_word)))
        ind_sent += 1
        end_gram = lst_sent[ind_sent] == lst_sent[ind_sent-1]
    #
    # --- give potential bigrams (ADJ - amod - NOUN) the label "amod_".
This way you always
    #     keep PROPN-ngrams (Hilary Rodham-Clinton, World Championship
Football etc.)
    #
    sent = []
    for segm in lst_sent[-1].split(" "):
        if segm.count("amod_") > 0:
            line = " ".join(re.sub("amod_", "", segm).split("_"))
            sent.append("amod_" + "_".join([token.lemma_ for token in
nlp(line)]))
        else:
            sent.append(segm)

    sentence = " ".join([token.lemma_ for token in nlp(" ".join(sent)) if
token.pos_ in allowed_pos])

    return sentence

The test sentence gives the result

text = generate_ngram(sent,nlp=nlp)
print(sent)
print(text)
---
Paris Hilton  and autonomous cars in the World Championship Football played
on  Sunday
in New York; Donald J. Trump Jr. was also attending

Paris_Hilton amod_autonomous_car world_championship_football Sunday
New_York Donald_J_Trump_Jr

The prefix "amod_" indicate the grammatical structure the bigram comes from. As said, it works for me, but feedback (especially cases that don't work) is welcome.

Andreas

PS: Some people may find code cleaning useful

On Mon, Feb 14, 2022 at 9:19 AM Maarten Grootendorst < @.***> wrote:

Glad to hear that it worked out! Let me know if you have any other issues.

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/347#issuecomment-1038737991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFO2PW3XX6DI4YTNILTV6HLU3CUIPANCNFSM5IYGSGRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

MaartenGr commented 2 years ago

@aph61 Thank you for the suggestion! There are indeed quite a number of methods for improving upon the generated topic representations depending on the use case. Interesting to see that you attempted to preprocess the documents before passing them to BERTopic. A quick suggestion, it might be interesting to also look at the CountVectorizer as it allows you to create your own custom tokenizer and preprocessor. Doing so makes it possible to keep the text as is but allows you to perform preprocessing steps on the topic representation steps, thereby creating independent processes.

aph61 commented 2 years ago

I don't process the sentences before I do the clustering. My philosophy is that, because the language model you use is based on regular text, your documents should be as well. Pre-processing will make your clusters worse (I think). The separate post-processing is done per cluster, as an alternative to characterize the topics with straightforward c-TF/IDF. The code solves the plural/singular, bigram .. problems in the topic representation (not the topic content, what sentence belongs to what topic). In your method you use the MMR for better divergence; that's not included in mine, I still have to do that

And using the CountVectorizer for speed: thanks for the pointer, the code needs serious improvement (I had quite a few coffee while waiting for the results ;). But I first wanted some feedback whether this approach makes sense or not

MaartenGr commented 2 years ago

Ah, right. In that case it might be worthwhile to look at the recently released KeyphraseVectorizer. It is very similar to what you have been doing. It allows for flexible POS patterns to extract the terms. Since you can integrate it directly into BERTopic, it maintains all the features of BERTopic including the usage of MMR.