How to do summarization to reduce the context length?

gauss5930 commented 10 months ago

Hello, I read your paper with great interest and have a question, which is why I'm posting it as an issue.

In the paper's 3.3 Analysis section, mentioned that performance improved by summarizing the responses of LM Agents into one due to the challenges of managing excessively long context lengths when concatenating LM Agent responses. Could you please provide more details about the setup for this summarization? I have organized my questions as follows:

In the paper, mentioned that summarization was implemented when using 5 or more agents. Does this mean that when using fewer agents, context length management was not as critical, and summarization wasn't necessary?
Were the answers summarized using a Language Model (LM)? If so, could you specify which model was utilized for this purpose?
If you used ChatGPT as a summarization model, could you share the specific prompts that were employed?

I want to express my gratitude once again for conducting such outstanding research and for crafting this paper. I'm currently involved in a project to implement multiagent-debate using various open-source LLMs, and your answers to these questions would be very helpful!

yilundu commented 10 months ago

Hi,

Thanks for the interest! Feel free to ask any additional follow up questions:

In general, we found summarization to help -- it more concisely summarizes results from agents. I've attached the code below -- we use chatGPT as the summarization model.

import openai
import json
import numpy as np
import time
import pickle
from tqdm import tqdm

def parse_bullets(sentence):
    bullets_preprocess = sentence.split("\n")
    bullets = []

    for bullet in bullets_preprocess:
        try:
            idx = bullet.find(next(filter(str.isalpha, bullet)))
        except:
            continue

        bullet = bullet[idx:]

        if len(bullet) != 0:
            bullets.append(bullet)

    return bullets

def filter_people(person):
    people = person.split("(")[0]
    return people

def generate_answer(answer_context):
    try:
        completion = openai.ChatCompletion.create(
                  model="gpt-3.5-turbo-0301",
                  messages=answer_context,
                  n=1)
    except:
        print("retrying due to an error......")
        time.sleep(20)
        return generate_answer(answer_context)

    return completion

def summarize_message(agent_contexts):
    prefix_string = "Here are a list of opinions from different agents: "

    for agent in agent_contexts:
        agent_response = agent[-1]["content"]
        response = "\n\n One agent response: ```{}```".format(agent_response)

        prefix_string = prefix_string + response

    prefix_string = prefix_string + "\n\n Write a summary of the different opinions from each of the individual agent."
    agent_context = [{"role": "user", "content": prefix_string}]
    completion = generate_answer(agent_context)
    content = completion["choices"][0]["message"]["content"]

    return content

def construct_message(summary, question):

    prefix_string = "Here is a summary of responses from other agents: {}".format(summary)

    prefix_string = prefix_string + "\n\n Use these opinions carefully as additional advice, can you provide an updated answer? Make sure to state your answer at the end of the response.".format(question)
    # prefix_string = prefix_string + "\n\n Using these opinions, can you provide an updated answer? Make sure to state your answer at the end of the response.".format(question)
    return {"role": "user", "content": prefix_string}

def construct_assistant_message(completion):
    content = completion["choices"][0]["message"]["content"]
    return {"role": "assistant", "content": content}

def parse_answer(sentence):
    parts = sentence.split(" ")
    # Sequentially parse for the last number in the sentence

    for part in parts[::-1]:
        try:
            answer = float(part)
            return answer
        except:
            continue

def most_frequent(List):
    counter = 0
    num = List[0]

    for i in List:
        current_frequency = List.count(i)
        if current_frequency > counter:
            counter = current_frequency
            num = i

    return num

if __name__ == "__main__":
    answer = parse_answer("My answer is the same as the other agents and AI language model: the result of 12+28*19+6 is 550.")

    agents = 4
    rounds = 2
    np.random.seed(0)

    evaluation_round = 100
    scores = []

    generated_description = {}

    for round in tqdm(range(evaluation_round)):
        a, b, c, d, e, f = np.random.randint(0, 30, size=6)

        answer = a + b * c + d - e * f
        agent_contexts = [[{"role": "user", "content": """What is the result of {}+{}*{}+{}-{}*{}? Make sure to state your answer at the end of the response.""".format(a, b, c, d, e, f)}] for agent in range(agents)]

        content = agent_contexts[0][0]['content']
        question_prompt = "We seek to find the result of {}+{}*{}+{}-{}*{}?".format(a, b, c, d, e, f)

        for round in range(rounds):
            for i, agent_context in enumerate(agent_contexts):

                if round != 0:
                    # agent_contexts_other = agent_contexts[:i] + agent_contexts[i+1:]
                    # message = construct_message(agent_contexts_other, question_prompt)
                    summary = summarize_message(agent_contexts)
                    message = construct_message(summary, question_prompt)
                    agent_context.append(message)

                    print("message: ", message)
                completion = generate_answer(agent_context)
                # try:
                #     completion = openai.ChatCompletion.create(
                #               model="gpt-3.5-turbo-0301",
                #               messages=agent_context,
                #               n=1)
                # except:
                #     time.sleep(20)
                #     completion = openai.ChatCompletion.create(
                #               model="gpt-3.5-turbo-0301",
                #               messages=agent_context,
                #               n=1)

                assistant_message = construct_assistant_message(completion)
                agent_context.append(assistant_message)
                print(completion)

        text_answers = []

        for agent_context in agent_contexts:
            text_answer = string =  agent_context[-1]['content']
            text_answer = text_answer.replace(",", ".")
            text_answer = parse_answer(text_answer)

            if text_answer is None:
                continue

            text_answers.append(text_answer)

            # print("text_answer: ", text_answer, answer)

            # if text_answer == answer:
            #     scores.append(1)
            # else:
            #     scores.append(0)

        generated_description[(a, b, c, d, e, f)] = (agent_contexts, answer)

        try:
            text_answer = most_frequent(text_answers)
            if text_answer == answer:
                scores.append(1)
            else:
                scores.append(0)
        except:
            continue

        print("performance:", np.mean(scores), np.std(scores) / (len(scores) ** 0.5))

    json.dump(generated_description, open("summarize_math_{}_{}.json".format(agents, rounds), "w"))
    # pickle.dump(generated_description, open("math_short_agents{}_rounds{}.p".format(agents, rounds), "wb"))
    import pdb
    pdb.set_trace()
    print(answer)
    print(agent_context)

gauss5930 commented 9 months ago

Hello yilundu! I am contacting you because there is good news about the implementation of multi-agent debate using the open-source LLMs that I mentioned earlier.

The project was inspired by the question "Is multi-agent debate effective on open-source LLMs?" I conducted the 'LLM Agora', a project that utilized several open-source LLMs and examined whether they were actually effective through multi-agent debate to resolve this question. Finally, I am reaching out as our project has been completed.

A brief summarization of the result is as follows:

The multi-agent debate was also effective when using open-source LLMs not proprietary ones.
The benefit of multi-agent debate is not much, since the quality of open-source LLMs' responses is not good, but if proper quality of responses were given, open-source LLMs benefited from multi-agent debate.

Through the LLM Agora project, I was able to confirm that the multi-agent debate proposed in the "Improving Factuality and Reasoning in Language Models through Multiagent Debate" is effective not only with proprietary models but also with open-source models. The experimental results of the project have been extensively documented in my GitHub Repository, and LLM Agora has been implemented on the HuggingFace Space, enabling the verification of the effectiveness of multi-agent debate using various open-source LLMs.

I was inspired by the multi-agent debate introduced in your paper, which led to the initiation of the LLM Agora project. I am extremely grateful for your contribution to writing such an outstanding paper. It would be a tremendous honor if you and the authors of the paper could take a look at my project!

yilundu commented 9 months ago

Great thanks!! Looks super cool -- I added a reference to the repo in the README on this repo 🙂

composable-models / llm_multiagent_debate

How to do summarization to reduce the context length? #11