ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.62k stars 9.71k forks source link

Bug: Raises InternalServerError for long text generation #9215

Closed ujjawal-ti closed 2 months ago

ujjawal-ti commented 2 months ago

What happened?

I'm trying to generate longer texts using Llama3-70B 8-bit quantized model hosted on A100 server (80GB GPU). It's too slow in token generation that I'm getting InternalServerError due to 504: Gateway time-out. It's working fine for less number (< 1k) of token generation. I observed the following after checking the server logs,

I'm using the default settings with llama.cpp:full-cuda docker image. Any idea how to resolve the issue?

Name and Version

I'm using the default settings with llama.cpp:full-cuda docker image.

What operating system are you seeing the problem on?

Linux

Relevant log output

---------------------------------------------------------------------------
InternalServerError                       Traceback (most recent call last)
Cell In[42], line 1
----> 1 response=query_llama3_70b('Tell me something about Delhi')
      2 response

Cell In[30], line 13, in query_llama3_70b(query_str, sys_prompt)
     11 system_content = sys_prompt
     12 client = openai.OpenAI(api_key = openai.api_key, base_url=openai.api_base)
---> 13 chat_completion = client.chat.completions.create(
     14     model="models/Meta-LLama-3-70B-Instruct-Q8_0.gguf",
     15     messages=[
     16         {"role": "system", "content": system_content},
     17         {"role": "user", "content": query_str},
     18     ],
     19     temperature=0.05,
     20     max_tokens=2048,
     21 )
     22 response = chat_completion.choices[0].message.content
     23 return response

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_utils/_utils.py:275, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
    273             msg = f"Missing required argument: {quote(missing[0])}"
    274     raise TypeError(msg)
--> 275 return func(*args, **kwargs)

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/resources/chat/completions.py:667, in Completions.create(self, messages, model, frequency_penalty, function_call, functions, logit_bias, logprobs, max_tokens, n, presence_penalty, response_format, seed, stop, stream, temperature, tool_choice, tools, top_logprobs, top_p, user, extra_headers, extra_query, extra_body, timeout)
    615 @required_args(["messages", "model"], ["messages", "model", "stream"])
    616 def create(
    617     self,
   (...)
    665     timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
    666 ) -> ChatCompletion | Stream[ChatCompletionChunk]:
--> 667     return self._post(
    668         "/chat/completions",
    669         body=maybe_transform(
    670             {
    671                 "messages": messages,
    672                 "model": model,
    673                 "frequency_penalty": frequency_penalty,
    674                 "function_call": function_call,
    675                 "functions": functions,
    676                 "logit_bias": logit_bias,
    677                 "logprobs": logprobs,
    678                 "max_tokens": max_tokens,
    679                 "n": n,
    680                 "presence_penalty": presence_penalty,
    681                 "response_format": response_format,
    682                 "seed": seed,
    683                 "stop": stop,
    684                 "stream": stream,
    685                 "temperature": temperature,
    686                 "tool_choice": tool_choice,
    687                 "tools": tools,
    688                 "top_logprobs": top_logprobs,
    689                 "top_p": top_p,
    690                 "user": user,
    691             },
    692             completion_create_params.CompletionCreateParams,
    693         ),
    694         options=make_request_options(
    695             extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
    696         ),
    697         cast_to=ChatCompletion,
    698         stream=stream or False,
    699         stream_cls=Stream[ChatCompletionChunk],
    700     )

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:1208, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
   1194 def post(
   1195     self,
   1196     path: str,
   (...)
   1203     stream_cls: type[_StreamT] | None = None,
   1204 ) -> ResponseT | _StreamT:
   1205     opts = FinalRequestOptions.construct(
   1206         method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   1207     )
-> 1208     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:897, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
    888 def request(
    889     self,
    890     cast_to: Type[ResponseT],
   (...)
    895     stream_cls: type[_StreamT] | None = None,
    896 ) -> ResponseT | _StreamT:
--> 897     return self._request(
    898         cast_to=cast_to,
    899         options=options,
    900         stream=stream,
    901         stream_cls=stream_cls,
    902         remaining_retries=remaining_retries,
    903     )

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:973, in SyncAPIClient._request(self, cast_to, options, remaining_retries, stream, stream_cls)
    971 if retries > 0 and self._should_retry(err.response):
    972     err.response.close()
--> 973     return self._retry_request(
    974         options,
    975         cast_to,
    976         retries,
    977         err.response.headers,
    978         stream=stream,
    979         stream_cls=stream_cls,
    980     )
    982 # If the response is streamed then we need to explicitly read the response
    983 # to completion before attempting to access the response text.
    984 if not err.response.is_closed:

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:1021, in SyncAPIClient._retry_request(self, options, cast_to, remaining_retries, response_headers, stream, stream_cls)
   1017 # In a synchronous context we are blocking the entire thread. Up to the library user to run the client in a
   1018 # different thread if necessary.
   1019 time.sleep(timeout)
-> 1021 return self._request(
   1022     options=options,
   1023     cast_to=cast_to,
   1024     remaining_retries=remaining,
   1025     stream=stream,
   1026     stream_cls=stream_cls,
   1027 )

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:973, in SyncAPIClient._request(self, cast_to, options, remaining_retries, stream, stream_cls)
    971 if retries > 0 and self._should_retry(err.response):
    972     err.response.close()
--> 973     return self._retry_request(
    974         options,
    975         cast_to,
    976         retries,
    977         err.response.headers,
    978         stream=stream,
    979         stream_cls=stream_cls,
    980     )
    982 # If the response is streamed then we need to explicitly read the response
    983 # to completion before attempting to access the response text.
    984 if not err.response.is_closed:

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:1021, in SyncAPIClient._retry_request(self, options, cast_to, remaining_retries, response_headers, stream, stream_cls)
   1017 # In a synchronous context we are blocking the entire thread. Up to the library user to run the client in a
   1018 # different thread if necessary.
   1019 time.sleep(timeout)
-> 1021 return self._request(
   1022     options=options,
   1023     cast_to=cast_to,
   1024     remaining_retries=remaining,
   1025     stream=stream,
   1026     stream_cls=stream_cls,
   1027 )

File /home/shared/.miniconda3/envs/py10cuda117/lib/python3.10/site-packages/openai/_base_client.py:988, in SyncAPIClient._request(self, cast_to, options, remaining_retries, stream, stream_cls)
    985         err.response.read()
    987     log.debug("Re-raising status error")
--> 988     raise self._make_status_error_from_response(err.response) from None
    990 return self._process_response(
    991     cast_to=cast_to,
    992     options=options,
   (...)
    995     stream_cls=stream_cls,
    996 )

InternalServerError: <!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>

<title>llama.mycompany.io | 504: Gateway time-out</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css" />

</head>
<body>
<div id="cf-wrapper">
    <div id="cf-error-details" class="p-0">
        <header class="mx-auto pt-10 lg:pt-6 lg:px-8 w-240 lg:w-full mb-8">
            <h1 class="inline-block sm:block sm:mb-2 font-light text-60 lg:text-4xl text-black-dark leading-tight mr-2">
              <span class="inline-block">Gateway time-out</span>
              <span class="code-label">Error code 504</span>
            </h1>
            <div>
               Visit <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=llama.playment.io" target="_blank" rel="noopener noreferrer">cloudflare.com</a> for more information.
            </div>
            <div class="mt-3">2024-06-21 10:40:21 UTC</div>
        </header>
        <div class="my-8 bg-gradient-gray">
            <div class="w-240 lg:w-full mx-auto">
                <div class="clearfix md:px-8">

<div id="cf-browser-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">

    <span class="cf-icon-browser block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>

  </div>
  <span class="md:block w-full truncate">You</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">

    Browser

  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-cloudflare-status" class=" relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=llama.playment.io" target="_blank" rel="noopener noreferrer">
    <span class="cf-icon-cloud block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-ok w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>
    </a>
  </div>
  <span class="md:block w-full truncate">Mumbai</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">
    <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=llama.playment.io" target="_blank" rel="noopener noreferrer">
    Cloudflare
    </a>
  </h3>
  <span class="leading-1.3 text-2xl text-green-success">Working</span>
</div>

<div id="cf-host-status" class="cf-error-source relative w-1/3 md:w-full py-15 md:p-0 md:py-8 md:text-left md:border-solid md:border-0 md:border-b md:border-gray-400 overflow-hidden float-left md:float-none text-center">
  <div class="relative mb-10 md:m-0">

    <span class="cf-icon-server block md:hidden h-20 bg-center bg-no-repeat"></span>
    <span class="cf-icon-error w-12 h-12 absolute left-1/2 md:left-auto md:right-0 md:top-0 -ml-6 -bottom-4"></span>

  </div>
  <span class="md:block w-full truncate">llama.playment.io</span>
  <h3 class="md:inline-block mt-3 md:mt-0 text-2xl text-gray-600 font-light leading-1.3">

    Host

  </h3>
  <span class="leading-1.3 text-2xl text-red-error">Error</span>
</div>

                </div>
            </div>
        </div>

        <div class="w-240 lg:w-full mx-auto mb-8 lg:px-8">
            <div class="clearfix">
                <div class="w-1/2 md:w-full float-left pr-6 md:pb-10 md:pr-0 leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What happened?</h2>
                    <p>The web server reported a gateway time-out error.</p>
                </div>
                <div class="w-1/2 md:w-full float-left leading-relaxed">
                    <h2 class="text-3xl font-normal leading-1.3 mb-4">What can I do?</h2>
                    <p class="mb-6">Please try again in a few minutes.</p>
                </div>
            </div>
        </div>

        <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">89735fcf8fc63c22</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span id="cf-footer-item-ip" class="cf-footer-item hidden sm:block sm:mb-1">
      Your IP:
      <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>
      <span class="hidden" id="cf-footer-ip">15.207.243.101</span>
      <span class="cf-footer-separator sm:hidden">&bull;</span>
    </span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=llama.playment.io" id="brand_link" target="_blank">Cloudflare</a></span>

  </p>
  <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
</div><!-- /.error-footer -->

    </div>
</div>
</body>
</html>
ngxson commented 2 months ago

It's too slow in token generation that I'm getting InternalServerError due to 504: Gateway time-out.

Gateway time-out is not a problem related to llama.cpp, but is a problem related to your reverse proxy.

You should try other runtimes like vllm to see if you get the same problem or not.

ujjawal-ti commented 2 months ago

Thanks @ngxson for pointing it out. It is resolved now.