AGiXT - llama.cpp - is not supporting ggmlv2 models or q5_1 ( 5 bit ) .... I think

mirek190 commented 1 year ago

Description

AGiXT - llama.cpp - is not supporting ggmlv2 models or q5_1 ( 5 bit ). Those llama.cpp models ggmlv2 are even obsolete now because llama.cpp has ggmlv3 models already ...

Another question - can I add llama.cpp parameter somehow? For instance -ngl ( GPU support ) or cubas ( prompt via GPU as well )

llama.cpp: loading model from models/wizardLM-7B-uncensored-ggmlv2-q5_1.bin error loading model: unknown (magic, version) combination: 67676a74, 00000002; is this really a GGML file? llama_init_from_file: failed to load model INFO: 127.0.0.1:4770 - "GET /api/agent/Wizard/command HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\uvicorn\protocols\http\httptools_impl.py", line 435, in run_asgi result = await app( # type: ignore[func-returns-value] File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\uvicorn\middleware\proxy_headers.py", line 78, in call return await self.app(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\fastapi\applications.py", line 276, in call await super().call(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\middleware\errors.py", line 184, in call raise exc File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\middleware\errors.py", line 162, in call await self.app(scope, receive, _send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\middleware\cors.py", line 91, in call await self.simple_response(scope, receive, send, request_headers=headers) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\middleware\cors.py", line 146, in simple_response await self.app(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\middleware\exceptions.py", line 79, in call raise exc File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\middleware\exceptions.py", line 68, in call await self.app(scope, receive, sender) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\fastapi\middleware\asyncexitstack.py", line 21, in call raise e File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\fastapi\middleware\asyncexitstack.py", line 18, in call await self.app(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\routing.py", line 718, in call await route.handle(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\routing.py", line 276, in handle await self.app(scope, receive, send) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\starlette\routing.py", line 66, in app response = await func(request) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\fastapi\routing.py", line 237, in app raw_response = await run_endpoint_function( File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\fastapi\routing.py", line 163, in run_endpoint_function return await dependant.call(values) File "F:\LLAMA\AGIXT\AGiXT\src\agixt\app.py", line 221, in get_commands commands = Commands(agent_name) File "F:\LLAMA\AGIXT\AGiXT\src\agixt\Commands.py", line 13, in init self.CFG = Agent(self.agent_name) File "F:\LLAMA\AGIXT\AGiXT\src\agixt\Config\Agent.py", line 28, in init self.PROVIDER = Provider(self.AI_PROVIDER, self.PROVIDER_SETTINGS) File "F:\LLAMA\AGIXT\AGiXT\src\agixt\provider__init.py", line 24, in init self.instance = provider_class(*kwargs) File "F:\LLAMA\AGIXT\AGiXT\src\agixt\provider\llamacpp.py", line 30, in init self.model = Llama(model_path=MODEL_PATH, n_ctx=self.MAX_TOKENS 2) File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 159, in init__ assert self.ctx is not None AssertionError

Steps to Reproduce the Bug

load model

Expected Behavior

Working model

Actual Behavior

Error?

Additional Context / Screenshots

No response

Operating System

[ ] Microsoft Windows
[ ] Apple MacOS
[ ] Linux
[ ] Android
[ ] iOS
[ ] Other

Python Version

[ ] Python <= 3.9
[ ] Python 3.10
[ ] Python 3.11

Environment Type - Connection

[X] Local
[ ] Remote

Environment Type - Container

[ ] Using Docker
[X] Not Using Docker

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of AGiXT.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

localagi commented 1 year ago

@Josh-XT i expect this is low priority - dunno if we want support all ext libs. Trying to think about excluding these into a extra-package (subpackage of agixt) to separate responsibilities.

@mirek190 As a quickfix, try running text-generation-webui (oobabooga), load the model there and connect via oobabooga provider on AGiXT

mirek190 commented 1 year ago

Oobabooga is using gpu for models so you will not be able to use big models. I want to use my CPU for it ( llama.cpp is most advanced and really fast especially with ggmlv3 models ) as I can run much bigger models like 30B 5bit or even 65B 5bit which are far more capable in understanding and reasoning than any one 7B or 13B mdel. For instance witch rtx 3080 and llama.cpp you can run 65B ggmlv3 q4 models with more than half layers on GPU and the rest on CPU and getting 6 tokens/s !

65B bit q4 model beat any model ... they are not even close like 7B,13B or 30B. Is very close to chatgpt 3.5 in reasoning...or even sometimes beat it especially gpt4-alpaca-lora_mlp-65B.ggmlv3.q5_1.bin which is closer to gpt4.

Do you understand how big progress is llama.cpp comparing to others projects? :) What why I want to be supported in this project .

Josh-XT commented 1 year ago

Not a low priority - just trying to get through bug fixes currently. I just removed the version cap for llama-cpp-python so that the latest can be used.

pip install llama-cpp-python --upgrade

Please let me know if there are additional flags or features that I should make available for this provider.

mirek190 commented 1 year ago

--mlock --threads --batch_size --n_predict
--top_k --top_p
--temp
--repeat_penalty --ctx_size --n-gpu-layers

Those ones are most important

And should have support to cublas.
That speed up prompt handling x3-x4 times for me.

Josh-XT commented 1 year ago

Sorry, I haven't been keeping up on the llamacpp changes, but I have heard they're amazing! I mostly use OpenAI for all of my testing currently just for the simple speed and reliability of it. I fully intend to switch fully to local models once I can run the 8k+ context models locally (which I should be able to now with llamacpp, just been busy.)

Here is the module we use:

https://github.com/abetlen/llama-cpp-python

If you can confirm they have the features there, I can add anything necessary. I think the gpu layers is new and is in the llamacpp python module. I can add that as an agent setting.

mirek190 commented 1 year ago

I will check that module later and let you know 👍

Thanks for you hard work .

Josh-XT commented 1 year ago

Merging #431 to hopefully resolve this. Please try it out and let me know how it goes!

mirek190 commented 1 year ago

if self.ctx is not None:

AttributeError: 'Llama' object has no attribute 'ctx' Exception ignored in: <function Llama.del at 0x00000183015FB250> Traceback (most recent call last): File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 1219, in del if self.ctx is not None: AttributeError: 'Llama' object has no attribute 'ctx' Exception ignored in: <function Llama.del at 0x00000183015FB250> Traceback (most recent call last): File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 1219, in del if self.ctx is not None: AttributeError: 'Llama' object has no attribute 'ctx' 2023-05-21 18:07:20.519 Uncaught app exception Traceback (most recent call last): File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script exec(code, module.dict) File "F:\LLAMA\AGiXT\agixt\pages\Chat.py", line 82, in agent = AGiXT(agent_name) File "F:\LLAMA\AGiXT\agixt\AGiXT.py", line 16, in init self.agent = Agent(self.agent_name) File "F:\LLAMA\AGiXT\agixt\Agent.py", line 28, in init self.PROVIDER = Provider(self.AI_PROVIDER, self.PROVIDER_SETTINGS) File "F:\LLAMA\AGiXT\agixt\provider__init.py", line 24, in init__ self.instance = provider_class(kwargs) File "F:\LLAMA\AGiXT\agixt\provider\llamacpp.py", line 38, in init self.model = Llama( File "C:\Users\mirek190\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 131, in init self.params.n_gpu_layers = n_gpu_layers TypeError: 'str' object cannot be interpreted as an integer

self.params.n_gpu_layers = n_gpu_layers           <-- must be integer

AttributeError: 'Llama' object has no attribute 'ctx' <-- should be " n_ctx"?

mirek190 commented 1 year ago

From llama.py

    """Load a llama.cpp model from `model_path`.

    Args:
        model_path: Path to the model.
        n_ctx: Maximum context size.
        n_parts: Number of parts to split the model into. If -1, the number of parts is automatically determined.
        seed: Random seed. 0 for random.
        f16_kv: Use half-precision for key/value cache.
        logits_all: Return logits for all tokens, not just the last token.
        vocab_only: Only load the vocabulary no weights.
        use_mmap: Use mmap if possible.
        use_mlock: Force the system to keep the model in RAM.
        embedding: Embedding mode only.
        n_threads: Number of threads to use. If None, the number of threads is automatically determined.
        n_batch: Maximum number of prompt tokens to batch together when calling llama_eval.
        last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque.
        lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
        lora_path: Path to a LoRA file to apply to the model.
        verbose: Print verbose output to stderr.

self.verbose = verbose self.model_path = model_path

    self.params = llama_cpp.llama_context_default_params()
    self.params.n_ctx = n_ctx
    self.params.n_parts = n_parts
    self.params.n_gpu_layers = n_gpu_layers
    self.params.seed = seed
    self.params.f16_kv = f16_kv
    self.params.logits_all = logits_all
    self.params.vocab_only = vocab_only
    self.params.use_mmap = use_mmap if lora_path is None else False
    self.params.use_mlock = use_mlock
    self.params.embedding = embedding

mirek190 commented 1 year ago

Newest llama.cpp executable build has API now.

https://github.com/ggerganov/llama.cpp/releases/tag/master-7e4ea5b

https://github.com/ggerganov/llama.cpp/commit/7e4ea5beff567f53be92f75f9089e6f11fa5dabd?short_path=42ce586#diff-42ce5869652f266b01a5b5bc95f4d945db304ce54545e2d0c017886a7f1cee1a

https://github.com/ggerganov/llama.cpp/tree/master/examples/server

Josh-XT commented 1 year ago

Working on this in #446 . If you have the API server running, you're welcome to try it.

Josh-XT commented 1 year ago

This was fixed.

mirek190 commented 1 year ago

I tried to use server llama.cpp but without a success ... Any guide how to use it here?

Josh-XT commented 1 year ago

I tried to use server llama.cpp but without a success ... Any guide how to use it here?

Don't use the llamacppapi one, just use the llamacpp one. The api one is still in progress, I haven't been able to run the llamacpp server yet myself to test that one entirely.

If it helps to know, these are my settings for my working Vicuna 13B with my llamacpp agent.

{
    "commands": {},
    "settings": {
        "provider": "llamacpp",
        "AI_MODEL": "vicuna",
        "AI_TEMPERATURE": "0.4",
        "MAX_TOKENS": "2000",
        "embedder": "default",
        "MODEL_PATH": "/home/josh/josh/Repos/ggml-vicuna-13b-1.1/ggml-vic13b-uncensored-q5_1.bin",
        "GPU_LAYERS": "40",
        "BATCH_SIZE": "512",
        "THREADS": "24",
        "STOP_SEQUENCE": "</s>",
        "SEARXNG_INSTANCE_URL": "https://searx.work",
        "HUGGINGFACE_AUDIO_TO_TEXT_MODEL": "facebook/wav2vec2-large-960h-lv60-self",
        "USE_BRIAN_TTS": "True",
        "ELEVENLABS_VOICE": "Josh",
        "SELENIUM_WEB_BROWSER": "chrome",
        "DISCORD_COMMAND_PREFIX": "/AGiXT",
        "WORKING_DIRECTORY": "./WORKSPACE",
        "WORKING_DIRECTORY_RESTRICTED": "True",
        "": ""
    }
}

Josh-XT / AGiXT