abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.18k stars 974 forks source link

How to improve GPU utilization #1674

Open xiangxinhello opened 3 months ago

xiangxinhello commented 3 months ago

I've noticed that the GPU utilization is very low during model inference, with a maximum of only 80%, but I want to increase the GPU utilization to 99%. How can I adjust the parameters? GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:8A:00.0 Off | 0 | | N/A 66C P0 205W / 250W | 14807MiB / 40960MiB | 78% Default

    N_THREADS = multiprocessing.cpu_count()
    self.runner = Llama(
        model_path=self.model_name,
        n_gpu_layers=-1,
        chat_format=self.generating_args["chat_format"],
        tokenizer=self.llama_tokenizer,
        flash_attn=True,
        verbose=False,
        n_ctx=1024,
        n_threads=N_THREADS // 2,
        n_threads_batch=N_THREADS
    )
    x = runner.create_chat_completion(
        messages=messages,
        top_p=0.0,
        top_k=1,
        temperature=1,
        max_tokens=512,
        seed=1337
    )

Originally posted by @xiangxinhello in https://github.com/abetlen/llama-cpp-python/issues/1669#issuecomment-2277577719

ayttop commented 3 months ago

how to run llamacpp python on gpu intel?