PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
772 stars 87 forks source link

[Usage]: Higher Context Length. #486

Open Abulhanan opened 1 month ago

Abulhanan commented 1 month ago

I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram... but i have only 30Gb vram, any way i can offload it into cpu ram???

Script i'm using

Your current environment

Model = "Abdulhanan2006/LogicLumina-IQ2_M" Revision = "main" Quantization = "gguf" GPU_Memory_Utilization = 1 Context_Length = 4096 launch_kobold_api = False OpenAI_API_Key = "" FP8_KV_Cache = False

Aphrodite Engine

%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y !echo "Installing/Updating the Aphrodite Engine, this may take a while..." %pip install aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl > /dev/null 2>&1 !echo "Installation successful! Starting the engine now."

%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y

!echo "Installing/Updating the Aphrodite Engine, this may take a while..."

%pip install aphrodite-engine==0.5.1 > /dev/null 2>&1

!echo "Installation successful! Starting the engine now."

RAY

!pip install -U "ray[all]" !pip install grpcio==1.62.1

Ngrok

!pip3 install pyngrok !echo "Creating a Ngrok URL..." from pyngrok import ngrok !ngrok authtoken 2gAo4R6fC0ND3YnulbgncrYrvLx_6E28TkGZZJTeT58MJ6GQY

Aphrodite

model = Model gpu_memory_utilization = GPU_Memory_Utilization context_length = Context_Length api_key = OpenAI_API_Key quant = Quantization kobold = launch_kobold_api revision = Revision fp8_kv = FP8_KV_Cache

command = [ "python", "-m", "aphrodite.endpoints.openai.api_server", "--dtype", "float16", "--model", model, "--host", "127.0.0.1", "--max-log-len", "0", "--gpu-memory-utilization", str(gpu_memory_utilization), "--max-model-len", str(context_length), "--tensor-parallel-size","2", "--enable-chunked-prefill" ]

if kobold: command.append("--launch-kobold-api")

if quant != "None": command.extend(["-q", quant])

if fp8_kv: command.append("--kv-cache-dtype fp8")

if api_key != "": command.extend(["--api-keys", api_key])

!ngrok http --domain=vertically-amazed-spider.ngrok-free.app 2242 & {" ".join(command)}

AlpinDale commented 1 month ago

You can try FP8 KV cache or use chunked prefill (mutually exclusive for now).

--kv-cache-dtype fp8 | --enable-chunked-prefill

Abulhanan commented 1 month ago

okay, but will they slow something down? or any accuracy loss?