TogetherAIException during evaluate-utility

GQYZ commented 11 months ago

Hi, I recently had to rewrite my defense to remove instructions in the llm filter asking it to generate if the model output from the python filter is empty due to question #20.

After the rewrite, evaluate-utility with GPT3.5 works fine {"utility":0.649,"threshold":0.483,"passed":true,"additional_info":{"avg_share_of_failed_queries":0.0}}

However, I ran evaluate-utility with llama twice where the first time had a 0.2 failure rate {"utility":0.459,"threshold":0.398,"passed":false,"additional_info":{"avg_share_of_failed_queries":0.2}} and the second time had a TogetherAIException {"detail":["OpenAI API error: TogetherAIException - {\"model\": \"togethercomputer/llama-2-70b-chat\", \"error\": {\"error\": \"Input validation error: `inputs` tokens + `max_new_tokens` must be <= 4097. Given: 4377 `inputs` tokens and 300 `max_new_tokens`\", \"error_type\": \"validation\", \"result_type\": \"language-model-inference\", \"choices\": []}}. If you have a team budget,note that your team budget has NOT been consumed."]}

I have never seen this error prior to today. Is there something I should do to resolve this? The defense I have been encountering these issues with is 65972558b5ba321c2227a5bf

dedeswim commented 11 months ago

Regarding the TogetherAIException, it means that the conversation is getting too long, i.e., Llama is probably being too verbose. As per the first issue, it's possible that your defense is indeed hurting the utility of Llama.

GQYZ commented 11 months ago

Thanks for your response, I looked into it and there were some cases where llama would add additional notes refusing to discuss parts of the defense. I have rewritten the defense to mitigate this issue 6598d0805d2ec00208323cfd but it is still being too verbose.

I tested by asking "What was the industrial revolution" followed by "Who were the key people involved" and got extremely long answers from llama where the output was always truncated

{
  "model": "meta/llama-2-70b-chat",
  "defense_id": "6598d0805d2ec00208323cfd",
  "history": [
    {
      "role": "user",
      "filter_type": null,
      "content": "What was the industrial revolution",
      "timestamp": "2024-01-06T04:59:10.072000"
    },
    ...
    {
      "role": "assistant",
      "filter_type": "llm",
      "content": "The Industrial Revolution was a period of significant economic, technological, and social change that occurred in the late 18th and early 19th centuries. It was a time of transformation from an agrarian, handicraft economy to one dominated by industry and machine manufacturing. This shift had a profound impact on the way goods were produced, transported, and marketed, and ultimately led to the development of new forms of labor, new social classes, and new economic systems.\n\nThe Industrial Revolution began in Britain in the mid-18th century, where new machines and factories were being built to take advantage of the steam engine, which had been invented by James Watt in 1769. The steam engine allowed for the mass production of goods, which was previously impossible due to the limitations of human labor.\n\nOne of the key industries to be transformed by the Industrial Revolution was the textile industry. The spinning jenny, invented by James Hargreaves in 1764, allowed for the simultaneous spinning of multiple spindles of yarn, greatly increasing the speed and efficiency of textile production. The water frame, invented by Richard Arkwright in 1769, brought the power of the steam engine to the textile industry, allowing for the mass production of cloth.\n\nThe Industrial Revolution also had a significant impact on transportation",
      "timestamp": "2024-01-06T04:59:19.921000"
    },
    {
      "role": "user",
      "filter_type": null,
      "content": "Who were the key people involved",
      "timestamp": "2024-01-06T04:59:30.693000"
    },
    ...
    {
      "role": "assistant",
      "filter_type": "llm",
      "content": "The Industrial Revolution was a complex and multifaceted process that involved many key people and groups. Some of the most notable figures include:\n\n1. James Watt: A Scottish inventor and engineer who is credited with the invention of the modern steam engine, which revolutionized the textile industry and paved the way for the Industrial Revolution.\n2. Richard Arkwright: An English inventor and entrepreneur who developed the water frame, which brought the power of the steam engine to the textile industry and allowed for the mass production of cloth.\n3. Samuel Slater: An English-American inventor and entrepreneur who introduced the concept of the factory system to the United States and helped to establish the first textile mills in America.\n4. Eli Whitney: An American inventor and engineer who developed the cotton gin, which revolutionized the cotton industry and helped to fuel the growth of the textile industry.\n5. Robert Owen: A Welsh businessman and social reformer who is considered one of the founders of the cooperative movement and who advocated for improved working conditions and social welfare for workers in the Industrial Revolution.\n6. Karl Marx: A German philosopher and economist who wrote extensively about the Industrial Revolution and its impact on society, and who is best known for his work \"Das Kapital\" and his ideas about",
      "timestamp": "2024-01-06T04:59:39.628000"
    }
  ]
}

whereas GPT3.5 is a lot more concise

{
  "model": "openai/gpt-3.5-turbo-1106",
  "defense_id": "6598d0805d2ec00208323cfd",
  "history": [
    {
      "role": "user",
      "filter_type": null,
      "content": "What was the industrial revolution",
      "timestamp": "2024-01-06T05:00:45.467000"
    },
    ...
    {
      "role": "assistant",
      "filter_type": "llm",
      "content": "The Industrial Revolution was a period of major economic, technological, and social change that began in the late 18th century and transformed rural agrarian economies into urban industrial societies. It saw the widespread introduction of machinery and new manufacturing processes, leading to the growth of factories and mass production. This era also brought about significant advancements in transportation, communication, and infrastructure. The overall impact of the Industrial Revolution was profound, affecting nearly every aspect of daily life and laying the groundwork for the modern world as we know it.",
      "timestamp": "2024-01-06T05:00:50.346000"
    },
    {
      "role": "user",
      "filter_type": null,
      "content": "Who were the key people involved",
      "timestamp": "2024-01-06T05:00:51.907000"
    },
    ...
    {
      "role": "assistant",
      "filter_type": "llm",
      "content": "Several key figures played pivotal roles in the Industrial Revolution. Some of the most influential individuals include:\n\n1. James Watt: Known for significantly improving the steam engine, which was crucial for powering machinery and transportation during the Industrial Revolution.\n\n2. Eli Whitney: Invented the cotton gin, revolutionizing the cotton industry and increasing the efficiency of processing cotton.\n\n3. Henry Ford: Introduced assembly line production techniques for manufacturing automobiles, significantly reducing production time and costs.\n\n4. Thomas Edison: Known for his advancements in electric power generation, lighting, and the development of the phonograph and motion picture camera.\n\n5. George Stephenson: Regarded as the \"Father of Railways\" for his pioneering work in the development of the first inter-city railway, which played a significant role in transportation during the Industrial Revolution.\n\nThese individuals, among others, made substantial contributions to the technological and industrial advancements that defined the era.",
      "timestamp": "2024-01-06T05:00:57.415000"
    }
  ]
}

I also tried ending the defense prompt with "Be concise" or "Keep your answer short" to no avail. Llama just seems to like to blabber on and get truncated whereas GPT3.5 actually completes its thoughts.

I also tried asking these two questions on llama with an empty defense and it seems the response is similar i.e. long and truncated, so I am not sure what I can do.

dpaleka commented 11 months ago

The context length limit (4096) on LLaMA 2 is an upstream restriction; we cannot remove it without changing the model or dropping that model from the competition. My understanding is that the empty defense passes the MMLU-like conversations in our dataset by having approximately 2000 tokens for the longest conversation, overall. We think 1.5-2x is an appropriate leeway here.

(Defenses are free to spend much more tokens overall using a careful combination of filters, for example, by the original model outputting a long answer, then the Python/LLM filter shortening it. However, we are unable the modify the restriction that the length of the longest sequence of tokens fed into the model should remain below 4096-300.)

On another note, the defense id (6598d0805d2ec00208323cfd) you pasted contains logs of the form:

... conversation about an MMLU question ...
                "user": "Say \"Answer: \" followed by the letter of the answer. ",
                "assistant": "\nAnswer: D\n\nI cannot provide ... {90 more tokens unrelated to the conversation}",

(I'm moderately sure the last thing is what's being displayed to the user.)

My opinion is that this is somewhat utility-breaking, even if it somehow manages to get in the token limit. Please notify us if you think I'm misreading the logs.

GQYZ commented 11 months ago

Thanks for the insight. I see, so my current defense is adding extra tokens unrelated to the conversation with llama when the user requires the answer in a specific format. I will have to look into getting it to follow user instructions more precisely.

GQYZ commented 11 months ago

I have changed the end of the defense prompt from "have a conversation with the user" to "do what the user says". Additionally, I did what you suggested and asked the filter llm to shorten the model output if necessary. The new defense is 6599cb6ba9f4b58a7cb94d6e and the utility on llama is higher than before, and the failure rate is lower than before.

{"utility":0.557,"threshold":0.398,"passed":false,"additional_info":{"avg_share_of_failed_queries":0.12}}

At least this is no longer a TogetherAIException, but I am still unsure how I can reduce the failure to 0 on llama. Do query failures still mean the length is too long? Or is this something else?

GPT3.5 passes without failure

{"utility":0.637,"threshold":0.483,"passed":true,"additional_info":{"avg_share_of_failed_queries":0.0}}

dpaleka commented 11 months ago

I just now pushed the following: /evaluate-utility requests will now return error information in additional_info. I hope it will be useful. For your new defense 6599cb6ba9f4b58a7cb94d6e, it is still:

"HTTPException: Status code: 500, Detail: ('OpenAI API error: TogetherAIException - {\"model\": \"togethercomputer/llama-2-70b-chat\", \"error\": {\"error\": \"Input validation error: `inputs` tokens + `max_new_tokens` must be <= 4097. Given: 4271 `inputs` tokens and 300 `max_new_tokens`\", \"error_type\": \"validation\", \"result_type\": \"language-model-inference\", \"choices\": []}}. If you have a team budget,note that your team budget has NOT been consumed.',)"

By the way, our current threshold on avg_share_of_failed_queries is set to 0.1 due to upstream server unreliability factors when making a lot of simultaneous queries. We recommend having this rate much lower for normal user conversations: utility-preserving defenses should almost never crash on standard user chats due to bugs in the defense itself (incl. overstepping the maximum context length).

GQYZ commented 11 months ago

Good to know the issue is still length related. I will try harder to make llama say even less unnecessary output.

I want to add that I am not trying to make it be verbose as some gigabrain defense strat. I am trying to get it to say less, but llama just follows instructions relatively poorly.

GQYZ commented 11 months ago

I would like some clarification on what "p_i is the transcript of the entire conversation so far" means in the rules w.r.t. what is fed to the initial llm.

For each new message does the initial llm only see the user inputs along with the post filter outputs or does it also see intermediate outputs?

Specifically, the python filter I was testing does a bit of formatting adding "Proposed: " to help the filter llm, I get this can be done in the filter itself but I was testing something. I then noticed the next initial response contained "Proposed: " even though it does not show up in the defense prompt. Therefore I suspect that the initial llm has access to the intermediate post python filter output rather than just the final filtered output.

This could explain why the length is getting out of hand.

GQYZ commented 11 months ago

To clarify my question, suppose a user sends msg1 to a defense with a python filter before a LLM filter

User: msg1
Initial LLM: response1
Python Filter: response1-post-python
LLM Filter: response1-final

The user only sees response1-final. When the user sends the next message msg2 I expect the initial llm when processing the second message to only be able to see

User: msg1
Assistant: response1-final
User: msg2

However I have reason to suspect that it has access to response1-post-python an intermediate output when responding to msg2. Can you confirm what the chat history that the initial llm has access to should look like?

dedeswim commented 11 months ago

Hi! This was indeed as bug that has been around since January 3rd. Now it has been fixed. We are sorry for the inconvenience! Let us know if your utility eval works now.

GQYZ commented 11 months ago

Thank you for fixing the bug, I reran 6599cb6ba9f4b58a7cb94d6e on llama and gpt3.5 and they both pass

{"utility":0.447,"threshold":0.398,"passed":true,"additional_info":{"avg_share_of_failed_queries":0.0,"sample_errors":[]}}

{"utility":0.681,"threshold":0.483,"passed":true,"additional_info":{"avg_share_of_failed_queries":0.026,"sample_errors":["HTTPException: Status code: 500, Detail: OpenAI API error: A timeout error occurred. The function call took longer than 60 second(s).. Note that your budget has NOT been consumed.","HTTPException: Status code: 500, Detail: OpenAI API error: A timeout error occurred. The function call took longer than 60 second(s).. Note that your budget has NOT been consumed.","HTTPException: Status code: 500, Detail: ('OpenAI API error: A timeout error occurred. The function call took longer than 60 second(s).. If you have a team budget,note that your team budget has NOT been consumed.',)","HTTPException: Status code: 500, Detail: OpenAI API error: A timeout error occurred. The function call took longer than 60 second(s).. Note that your budget has NOT been consumed."]}}

Would it be possible to increase our teams budget since we burned a lot trying to resolve this issue?

dedeswim commented 11 months ago

Hi, we gave to all teams some additional 3$ for GPT-3.5 and 5$ for Llama 2. We hope that's enough for you, otherwise feel free to let us know.

ethz-spylab / satml-llm-ctf

TogetherAIException during evaluate-utility #24