How to improve GPU utilization

    N_THREADS = multiprocessing.cpu_count()
    self.runner = Llama(
        model_path=self.model_name,
        n_gpu_layers=-1,
        chat_format=self.generating_args["chat_format"],
        tokenizer=self.llama_tokenizer,
        flash_attn=True,
        verbose=False,
        n_ctx=1024,
        n_threads=N_THREADS // 2,
        n_threads_batch=N_THREADS
    )
    x = runner.create_chat_completion(
        messages=messages,
        top_p=0.0,
        top_k=1,
        temperature=1,
        max_tokens=512,
        seed=1337
    )

Originally posted by @xiangxinhello in https://github.com/abetlen/llama-cpp-python/issues/1669#issuecomment-2277577719

abetlen / llama-cpp-python

How to improve GPU utilization #1674