MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.04k stars 757 forks source link

chatgpt topic labels are brittle for large neighborhoods. #1066

Closed danielpatrickhug closed 1 year ago

danielpatrickhug commented 1 year ago

large topic clusters summarized by chatgpt have brittle topic labels, they overfit to the top three docs. I would like to add a new summarizer that summarizes topic summaries of different samples of representative docs. referenced to #1065

MaartenGr commented 1 year ago

Could you explain a bit more in detail what that implementation would look like? Would you suggest creating an OpenAI summarizer? And how would the resulting labels be created?

danielpatrickhug commented 1 year ago

Hi, sure, one implementation could look like the following:

So this is the current implementation.

def extract_topics(self,
                       topic_model,
                       documents: pd.DataFrame,
                       c_tf_idf: csr_matrix,
                       topics: Mapping[str, List[Tuple[str, float]]]
                       ) -> Mapping[str, List[Tuple[str, float]]]:
        """ Extract topics
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top 4 representative documents per topic
        repr_docs_mappings, _, _ = topic_model._extract_representative_docs(c_tf_idf, documents, topics, 500, 4)

        # Generate using OpenAI's Language Model
        updated_topics = {}
        for topic, docs in repr_docs_mappings.items():
            prompt = self._create_prompt(docs, topic, topics)

            # Delay
            if self.delay_in_seconds:
                time.sleep(self.delay_in_seconds)

            if self.chat:
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ]
                kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
                response = openai.ChatCompletion.create(**kwargs)
                label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
            else:
                response = openai.Completion.create(model=self.model, prompt=prompt, **self.generator_kwargs)
                label = response["choices"][0]["text"].strip()

            updated_topics[topic] = [(label, 1)] + [("", 0) for _ in range(9)]

        return updated_topics

For each topic cluster it gets the top 4 representative docs and cftidf words and generates a label using the prompt:

prompt = """
I have topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""

and here's the function used to sample the docs.

the function _extract_representative_docs()

def _extract_representative_docs(self,
                                     c_tf_idf: csr_matrix,
                                     documents: pd.DataFrame,
                                     topics: Mapping[str, List[Tuple[str, float]]],
                                     nr_samples: int = 500,
                                     nr_repr_docs: int = 5,
                                     ) -> Union[List[str], List[List[int]]]:
        """ Approximate most representative documents per topic by sampling
        a subset of the documents in each topic and calculating which are
        most representative to their topic based on the cosine similarity between
        c-TF-IDF representations.
        Arguments:
            c_tf_idf: The topic c-TF-IDF representation
            documents: All input documents
            topics: The candidate topics as calculated with c-TF-IDF
            nr_samples: The number of candidate documents to extract per topic
            nr_repr_docs: The number of representative documents to extract per topic
        Returns:
            repr_docs_mappings: A dictionary from topic to representative documents
            representative_docs: A flat list of representative documents
            repr_doc_indices: The indices of representative documents
                              that belong to each topic
        """
        # Sample documents per topic
        documents_per_topic = (
            documents.groupby('Topic')
                     .sample(n=nr_samples, replace=True, random_state=42)
                     .drop_duplicates()
        )

        # Find and extract documents that are most similar to the topic
        repr_docs = []
        repr_docs_indices = []
        repr_docs_mappings = {}
        labels = sorted(list(topics.keys()))
        for index, topic in enumerate(labels):

            # Calculate similarity
            selected_docs = documents_per_topic.loc[documents_per_topic.Topic == topic, "Document"].values
            bow = self.vectorizer_model.transform(selected_docs)
            ctfidf = self.ctfidf_model.transform(bow)
            sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])

            # Extract top n most representative documents
            nr_docs = nr_repr_docs if len(selected_docs) > nr_repr_docs else len(selected_docs)
            indices = np.argpartition(sim_matrix.reshape(1, -1)[0],
                                      -nr_docs)[-nr_docs:]
            repr_docs.extend([selected_docs[index] for index in indices])
            repr_docs_indices.append([repr_docs_indices[-1][-1] + i + 1 if index != 0 else i for i in range(nr_docs)])
        repr_docs_mappings = {topic: repr_docs[i[0]:i[-1]+1] for topic, i in zip(topics.keys(), repr_docs_indices)}

        return repr_docs_mappings, repr_docs, repr_docs_indices

Implementation 1: A new function could be created to sample points around the mean(or max, mode, etc) of the local cluster distribution within maybe 2 or 3 standard deviations. and then for a large topic, you could run it for example 5 times with different random samples and get the topic labels and summarize with the prompt below. (I've tested this with flant5 and got good results)

Implementation 2: Get the top 20-30(instead of 4) representative docs for each cluster(larger than n) and just randomly sample 3 docs m times from the list repr_docs_mappings[topic_id] in extract_topics and run the Chatgpt pipe m times in a nested for loop.

After you have the 5 topic variations you can use a prompt like:

f"""You have analyzed multiple topics and their corresponding documents and keywords. Your findings are summarized as follows:

{SUMMARY_OF_TOPICS}

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""

Another implementation I thought of yesterday, Implementation 3: extract all the docs for a given neighborhood(if large enough) and then rerun the pipeline on the subset to get local subclusters then summarize the topic labels and update the main topic_model's topic labels. I haven't tested it, but pretty sure this can be done without a modification.

Let me know what you think or if you have any suggestions :)

danielpatrickhug commented 1 year ago

in some cases i've also success passing in a hierarchical topic tree of a cluster for example, I wrote a notebook to recursively walked through a GitHub repo converting python files into a AST like data structure and then running multiple summary system prompts over the code and then I topic modeled the outputs

system_prompts = {
        "summary": f"""
                    Summarize the code the GitHub repository: {git_repo_path} you're currently in the file {file_name}
                    ChatGPT will use its advanced natural language processing capabilities to analyze the code and generate a concise summary that captures
                    the main functionality and purpose of the codebase. Additionally, ChatGPT can provide insights into the programming languages and libraries used in the repository,
                    as well as any notable features or functionalities that are present. Simply provide the necessary information, and let ChatGPT do the rest! 
                    """,
        "bug_finder": f"""
                    Help identify bugs in the codebase of the GitHub repository: {git_repo_path} you're currently in the file {file_name}.
                    ChatGPT will analyze the codebase to identify potential bugs and provide suggestions on how to fix them.
                    Let ChatGPT assist you in your debugging efforts and make your code more robust and reliable.
                    """,
        "todo_labeler": f"""
                    Automatically label and generate TODO comments in the codebase of the GitHub repository: {git_repo_path} you're currently in the file {file_name}.
                    ChatGPT will scan the codebase for any potential TODO tasks and categorize them based on their priority and complexity.
                    Let ChatGPT help you stay organized and on top of your development tasks.
                    """,
        "code_suggestions": f"""
                    Get suggestions for improving the codebase of the GitHub repository: {git_repo_path} you're currently in the file {file_name}.
                    ChatGPT will analyze the codebase and provide suggestions for improving the code quality, optimizing performance, and enhancing functionality.
                    Let ChatGPT help you take your codebase to the next level.
                    """,
        "question_asking": f"""
                    Ask ChatGPT questions about the codebase of the GitHub repository: {git_repo_path} you're currently in the file {file_name}.
                    ChatGPT asks questions that a new developer may ask about the codebase used in the repository, 
                    as well as answer the question with step by step reasoning as a senoir dev would. All responses should first ask a question and then answer with reasoning.
                    """,
        "complement_code": f"""
                    Give compliments on the codebase of the GitHub repository: {git_repo_path} you're currently in the file {file_name}.
                    ChatGPT will analyze the codebase and provide positive feedback on the structure, organization, readability, and other aspects of the code.
                    Let ChatGPT help you feel good about your code and celebrate your accomplishments!
                    """,
                    }
f"""You have analyzed multiple topics and their corresponding documents and keywords. Your findings are summarized as follows:

.
├─■──Refactoring and improving function documentation and error handling____ ── Topic: 3
└─Code documents and embeddings in topic modeling pipeline____
     ├─■──Improving BERTopic Code and Visualization through Functionality Updates and Customization____ ── Topic: 1
     └─Code organization and improvement suggestions for embedding models____
          ├─■──Improving Logging Class, Dimensionality Reduction Class and Zero-Shot Classification Exception Handl ── Topic: 0
          └─■──Suggestions to improve Scikit-Learn based embedding model in Python codebase____ ── Topic: 2

Based on the information above, generate a short topic label summarizing the entire tree in the following format:
topic: <topic label>
"""

topic: Code organization and improvement suggestions for embedding models, BERTopic code and visualization improvements, and refactoring function documentation and error handling.

danielpatrickhug commented 1 year ago

Theres also the option to generate image prompts from the clusters too.

danielpatrickhug commented 1 year ago

topic tree (with added code) from taking the graph laplacian and message passing features of the message_passing aggregated features adj matrix. I added singular value plots for the different adj matrixes and graph laplacians https://colab.research.google.com/drive/17Ke5VkBRFkM1RIX9Vi1LDxLTGZGGN_vS?usp=sharing

.
├─Embedding models for text similarity____
│    ├─■──Class-based TF-IDF procedure and c-TF-IDF formula.____ ── Topic: 4
│    └─Hugging Face transformers model with feature generation pipeline for extracting embeddings____
│         ├─Visualization of Topics and Hierarchy using BERTopic and Spacy____
│         │    ├─Topic visualization with embedding models____
│         │    │    ├─■──Topic Mapping and Management in BERTopic____ ── Topic: 2
│         │    │    └─■──Visualization of hierarchical topics using spaCy embeddings with BERTopic____ ── Topic: 0
│         │    └─■──Text Generation with Transformers - Adding Documentation and Improving User Options____ ── Topic: 1
│         └─Scikit-Learn based Pipeline for Embedding Words and Documents____
│              ├─■──Scikit-Learn based document and word embedding with flexible API and verbosity control using Sklearn ── Topic: 3
│              └─■──Online CountVectorizer with Updating Vocabulary in NLP Pipeline with Hugging Face's Transformers____ ── Topic: 6
└─HDBSCAN for generating predictions and probabilities in a custom tool____
     ├─■──Notification of missing dependencies for string matching model.____ ── Topic: 7
     └─■──HDBSCAN implementation for generating predictions and probabilities with cluster models and approxim ── Topic: 5
MaartenGr commented 1 year ago

Implementation 1: A new function could be created to sample points around the mean(or max, mode, etc) of the local cluster distribution within maybe 2 or 3 standard deviations. and then for a large topic, you could run it for example 5 times with different random samples and get the topic labels and summarize with the prompt below. (I've tested this with flant5 and got good results)

To an extent, this functionality is found within ._extract_representative_docs as it samples around the topic vector, which I prefer overtaking the centroid as that tends to focus on local representations. The resulting representative documents can be embedded and diversified with MMR to generate a more diverse representation but controlling the diversity is a bit difficult as you do not know beforehand what is acceptable.

Implementation 2: Get the top 20-30(instead of 4) representative docs for each cluster(larger than n) and just randomly sample 3 docs m times from the list repr_docs_mappings[topic_id] in extract_topics and run the Chatgpt pipe m times in a nested for loop.

I think when ChatGPT needs to be run multiple times per topic, it will need additional documentation/warnings as one major advantage of using ChatGPT within BERTopic is that only a single API call is needed per topic. This is helpful to trial users of OpenAI/Cohere and reduces computation when using flan-like models.

Now that I think about it, when you run the pipeline multiple times, perhaps something like this is much better suited for the LangChain component than the ChatGPT one.

Implementation 3: extract all the docs for a given neighborhood(if large enough) and then rerun the pipeline on the subset to get local subclusters then summarize the topic labels and update the main topic_model's topic labels. I haven't tested it, but pretty sure this can be done without a modification.

Just to be sure I understand correctly, you mean rerunning the entire BERTopic pipeline on the local neighborhood right? If so, then I would not underestimate the complexity with respect to parameter tuning

All in all, I think a minimal solution would be to simply increase the number of documents passed to ChatGPT but it would indeed not guarantee a sufficiently diverse representation of documents. Moreover, a single solution would be needed as this implementation would also be added to all text generation methods.

danielpatrickhug commented 1 year ago

Just to be sure I understand correctly, you mean rerunning the entire BERTopic pipeline on the local neighborhood right? If so, then I would not underestimate the complexity with respect to parameter tuning

I could be wrong but the node embeddings seem to be a lot more robust to changes in hyperparameters since they've been preprocessed with message passing and aggregation. So similar docs are already grouped into subgraphs.

MaartenGr commented 1 year ago

I could be wrong but the node embeddings seem to be a lot more robust to changes in hyperparameters since they've been preprocessed with message passing and aggregation. So similar docs are already grouped into subgraphs.

There are a number of things that can affect cluster generation. For example, if a cluster does not have a sufficient number of documents, then HDBSCAN will not cluster them. So using a default pipeline for local documents might not work and would require adjusting some hyperparameters. Not saying it should not be done but just be aware that it might be difficult to do this in a model-agnostic way. It is possible to use HDBSCAN, create many micro-clusters, and built them up using its internal hierarchy. This method is also done in thisnotthat.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if I need to re-open the issue!