Abulhanan commented 2 weeks ago

Your current environment

The output of `python env.py`

How did you install Aphrodite?

When using pre conversion i ran into following error ValueError: 17 is not a valid GGMLQuantizationType

but when using automatic conversion my engine itself, error doesn't shows and it converts but i'm on low memory, with auto conversion.

Abulhanan commented 2 weeks ago

Version Of Aphrodite v0.5.1

sgsdxzy commented 2 weeks ago

Can you provide a link to your gguf file? And 0.5.1 is very old, please try 0.5.2 (download the .whl and install via pip), or build from main/dev.

Abulhanan commented 2 weeks ago

wait

Abulhanan commented 2 weeks ago

@title v-- Run this cell to start the engine.

@markdown The free plan on Google Colab only supports up to 13B (quantized).

@markdown You can enter a custom model as well in addition to the default ones. Supported models types are:

@markdown ****

Model = "/kaggle/" # @param +.3["Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2", "TheBloke/UNA-TheBeagle-7B-v1-GPTQ", "LoneStriker/Fimbulvetr-11B-v2-GPTQ", "TheBloke/OpenHermes-2.5-Mistral-7B-AWQ", "TheBloke/MythoMax-L2-13B-GPTQ", "TheBloke/wizard-mega-13B-GPTQ"] Austism/chronos-hermes-13b-v2-GPTQ KoboldAI/OPT-6B-nerys-v2 NousResearch/Nous-Hermes-Llama2-13b {allow-input: true}

@markdown The specific model branch to download. Useful for exl2 models where every bpw is on separate branches.

Revision = "main" #@param []{allow-input: true}

@markdown Should be auto-recognized for most models. If you receive a KeyError, or unexpectedly run out of memory for small models, please use this to specify the correct quant format. Most exl2 models have this issue, so configure this for exl2 models.

Quantization = "gguf" #@param ["None", "exl2", "gptq", "awq", "aqlm", "quip", "marlin"]

@markdown Adjust this and the Context Length slider if you're running into COOM (CUDA Out Of Memory) issues!

GPU_Memory_Utilization = 1 #@param {type:"slider", min:0, max:1, step:0.01}

@markdown The free Colab GPU may not have enough memory to accomodate more than 8192 Context Length for most models.

Context_Length = 16000 #@param {type:"slider", min:1024, max:32768, step:1024}

@markdown Disable CUDA graphs. This will reduce memory usage. Uncheck if your model is small. Keep it checked for anything above 11B.

enforce_eager_mode = True #@param {type:"boolean"}

@markdown Check this to launch a Kobold-compatible API in addition to the OpenAI one. Keep in mind that the API key does not protect Kobold routes.

launch_kobold_api = False #@param {type:"boolean"}

@markdown [OPTIONAL] Enter an API key to secure your API.

OpenAI_API_Key = "" #@param []{allow-input: true} FP8_KV_Cache = True #@param {type:"boolean"}

RAY

!pip install -U "ray[all]" !pip install grpcio==1.62.1

Aphrodite Engine

%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y !echo "Installing/Updating the Aphrodite Engine, this may take a while..." %pip install aphrodite-engine==0.5.1 > /dev/null 2>&1 !echo "Installation successful! Starting the engine now."

Ngrok

!pip3 install pyngrok !echo "Creating a Ngrok URL..." from pyngrok import ngrok !ngrok authtoken 2Xek0NdHusUxivPazybUushIkyx_6gf88UA2EDx34b2RKw8r1 tunnel = ngrok.connect(2242) !echo "============================================================" !echo "Please copy this URL:" print(tunnel.public_url) !echo "============================================================"

model = Model gpu_memory_utilization = GPU_Memory_Utilization context_length = Context_Length api_key = OpenAI_API_Key quant = Quantization enforce_eager = enforce_eager_mode kobold = launch_kobold_api revision = Revision fp8_kv = FP8_KV_Cache

command = [ "python", "-m", "aphrodite.endpoints.openai.api_server", "--dtype", "float16", "--model", model, "--host", "127.0.0.1", "--max-log-len", "0", "--gpu-memory-utilization", str(gpu_memory_utilization), "--max-model-len", str(context_length), "--tensor-parallel-size","2", "--tokenizer philschmid/meta-llama-3-tokenizer" ]

if kobold: command.append("--launch-kobold-api")

if quant != "None": command.extend(["-q", quant])

if enforce_eager: command.append("--enforce-eager")

if fp8_kv: command.append("--kv-cache-dtype fp8_e5m2")

if api_key != "": command.extend(["--api-keys", api_key])

!{" ".join(command)}

Abulhanan commented 2 weeks ago

the code.

Abulhanan commented 2 weeks ago

!git clone https://github.com/PygmalionAI/aphrodite-engine.git %cd /kaggle/working/aphrodite-engine/examples/ !python gguf_to_torch.py --input /kaggle/working/Meta-Llama-3-70B-Instruct.IQ1_S.gguf --output /kaggle/

The pre coversion code.

Abulhanan commented 2 weeks ago

%cd /kaggle/ !wget -N https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/resolve/main/Meta-Llama-3-70B-Instruct.IQ2_XS.gguf

The code i use to download models.

Abulhanan commented 2 weeks ago

it's modified version of colab code but i use kaggle it has 2 T4 gpus and 30Gb ram.

Abulhanan commented 2 weeks ago

if i install it via pip how do i access the pre-conversion script?

sgsdxzy commented 2 weeks ago

You are using unmatched aphrodite package and conversation script. You installed v0.5.1 but used the conversation script of main. You need to either git checkout the specific tag v0.5.1, or build from source. And I am not sure if v0.5.1 supports Llama 3, probably you need a newer version to support the new tokenizer.

Abulhanan commented 2 weeks ago

okay i'm a bit new but can you help me little to install it??

sgsdxzy commented 2 weeks ago

Since you are using kaggle I suppose you can't build from source, so you would have to wait for the 0.5.3 release to run llama 3 models.

Abulhanan commented 2 weeks ago

i can run llama 3 easily

Abulhanan commented 2 weeks ago

8b model easily

Abulhanan commented 2 weeks ago

but for quantization i get stuck any model .

Abulhanan commented 2 weeks ago

%cd /kaggle/ !wget -N https://github.com/PygmalionAI/aphrodite-engine/releases/download/v0.5.2/aphrodite_engine-0.5.2+cu118-cp310-cp310-manylinux1_x86_64.whl !pip install /kaggle/aphrodite_engine-0.5.2+cu118-cp310-cp310-manylinux1_x86_64.whl

i used this script and now it's installing it using pip, but now how do i access the conversion?

Abulhanan commented 2 weeks ago

i can also build from source by git cloning.

sgsdxzy commented 2 weeks ago

git checkout <tag> For example git checkout v0.5.2 to get the conversation script for 0.5.2

Abulhanan commented 2 weeks ago

okay..

Abulhanan commented 2 weeks ago

after using pip installation?

sgsdxzy commented 2 weeks ago

First please keep this github issue clean and precise. Github issues are not chat rooms, please group your responses as a single block if convenient. The source code and installed pip package are two separate things. If you have installed the v0.5.2 pip package, you clone the Aphrodite git repo, and checkout the responding code using git checkout v0.5.2, then run the conversion script. That assumes you have a shell environment. I don't know how kaggle works, if you have questions related to kaggle or using git checkout/pip in kaggle, please ask them in the corresponding support channels.

PygmalionAI / aphrodite-engine

[Installation]: ValueError: 17 is not a valid GGMLQuantizationType #448

Your current environment

How did you install Aphrodite?

@title v-- Run this cell to start the engine.

@markdown The free plan on Google Colab only supports up to 13B (quantized).

@markdown You can enter a custom model as well in addition to the default ones. Supported models types are:

@markdown ****

@markdown The specific model branch to download. Useful for exl2 models where every bpw is on separate branches.

@markdown Should be auto-recognized for most models. If you receive a KeyError, or unexpectedly run out of memory for small models, please use this to specify the correct quant format. Most exl2 models have this issue, so configure this for exl2 models.

@markdown Adjust this and the Context Length slider if you're running into COOM (CUDA Out Of Memory) issues!

@markdown The free Colab GPU may not have enough memory to accomodate more than 8192 Context Length for most models.

@markdown Disable CUDA graphs. This will reduce memory usage. Uncheck if your model is small. Keep it checked for anything above 11B.

@markdown Check this to launch a Kobold-compatible API in addition to the OpenAI one. Keep in mind that the API key does not protect Kobold routes.

@markdown [OPTIONAL] Enter an API key to secure your API.

RAY

Aphrodite Engine

Ngrok