I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram... but i have only 30Gb vram, any way i can offload it into cpu ram???
I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram... but i have only 30Gb vram, any way i can offload it into cpu ram???
Script i'm using
Your current environment
Model = "Abdulhanan2006/LogicLumina-IQ2_M" Revision = "main" Quantization = "gguf" GPU_Memory_Utilization = 1 Context_Length = 4096 launch_kobold_api = False OpenAI_API_Key = "" FP8_KV_Cache = False
Aphrodite Engine
%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y !echo "Installing/Updating the Aphrodite Engine, this may take a while..." %pip install aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl > /dev/null 2>&1 !echo "Installation successful! Starting the engine now."
%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y
!echo "Installing/Updating the Aphrodite Engine, this may take a while..."
%pip install aphrodite-engine==0.5.1 > /dev/null 2>&1
!echo "Installation successful! Starting the engine now."
RAY
!pip install -U "ray[all]" !pip install grpcio==1.62.1
Ngrok
!pip3 install pyngrok !echo "Creating a Ngrok URL..." from pyngrok import ngrok !ngrok authtoken 2gAo4R6fC0ND3YnulbgncrYrvLx_6E28TkGZZJTeT58MJ6GQY
Aphrodite
model = Model gpu_memory_utilization = GPU_Memory_Utilization context_length = Context_Length api_key = OpenAI_API_Key quant = Quantization kobold = launch_kobold_api revision = Revision fp8_kv = FP8_KV_Cache
command = [ "python", "-m", "aphrodite.endpoints.openai.api_server", "--dtype", "float16", "--model", model, "--host", "127.0.0.1", "--max-log-len", "0", "--gpu-memory-utilization", str(gpu_memory_utilization), "--max-model-len", str(context_length), "--tensor-parallel-size","2", "--enable-chunked-prefill" ]
if kobold: command.append("--launch-kobold-api")
if quant != "None": command.extend(["-q", quant])
if fp8_kv: command.append("--kv-cache-dtype fp8")
if api_key != "": command.extend(["--api-keys", api_key])
!ngrok http --domain=vertically-amazed-spider.ngrok-free.app 2242 & {" ".join(command)}