c0sogi / llama-api

An OpenAI-like LLaMA inference API
MIT License
111 stars 9 forks source link

Set number of cores being used on cpu? #16

Closed Dougie777 closed 12 months ago

Dougie777 commented 12 months ago

I am on a box with 19 physical cores, but only it looks like only 9 or 10 are being used. Is there a way to specify the number of cores to use?

c0sogi commented 12 months ago

I assume that you are running llama.cpp model. In this case, I think giving n_threads=19 to LlamaCppModel will enable your all processors.

Check out the LlamaCppModel definitions.

@dataclass
class LlamaCppModel(BaseLLMModel):
    """Llama.cpp model that can be loaded from local path."""

    n_parts: int = field(
        default=-1,
        metadata={
            "description": "Number of parts to split the model into. If -1, "
            "the number of parts is automatically determined."
        },
    )
    n_gpu_layers: int = field(
        default=30,
        metadata={
            "description": "Number of layers to keep on the GPU. "
            "If 0, all layers are kept on the GPU."
        },
    )
    seed: int = field(
        default=-1,
        metadata={"description": "Seed. If -1, a random seed is used."},
    )
    f16_kv: bool = field(
        default=True,
        metadata={"description": "Use half-precision for key/value cache."},
    )
    logits_all: bool = field(
        default=False,
        metadata={
            "description": "Return logits for all tokens, "
            "not just the last token."
        },
    )
    vocab_only: bool = field(
        default=False,
        metadata={"description": "Only load the vocabulary, no weights."},
    )
    use_mlock: bool = field(
        default=True,
        metadata={"description": "Force system to keep model in RAM."},
    )
    n_batch: int = field(
        default=512,
        metadata={
            "description": "Number of tokens to process in parallel. "
            "Should be a number between 1 and n_ctx."
        },
    )
    last_n_tokens_size: int = field(
        default=64,
        metadata={
            "description": "The number of tokens to look back "
            "when applying the repeat_penalty."
        },
    )
    use_mmap: bool = True  # Whether to use memory mapping for the model.
    cache: bool = (
        False  # The size of the cache in bytes. Only used if cache is True.
    )
    verbose: bool = True  # Whether to echo the prompt.
    echo: bool = True  # Compatibility of verbose.
    lora_base: Optional[str] = None  # The path to the Llama LoRA base model.
    lora_path: Optional[
        str
    ] = None  # The path to the Llama LoRA. If None, no LoRa is loaded.
    cache_type: Optional[Literal["disk", "ram"]] = "ram"
    cache_size: Optional[int] = (
        2 << 30
    )  # The size of the cache in bytes. Only used if cache is True.
    n_threads: Optional[int] = field(
        default=None,
        metadata={
            "description": "Number of threads to use. "
            "If None, the number of threads is automatically determined."
        },
    )
    low_vram: bool = False  # Whether to use less VRAM.
    embedding: bool = False  # Whether to use the embedding layer.

    # Refer: https://github.com/ggerganov/llama.cpp/pull/2054
    rope_freq_base: float = 10000.0  # I use 26000 for n_ctx=4096.
    rope_freq_scale: float = 1.0  # Generally, 2048 / n_ctx.
    n_gqa: Optional[int] = None  # TEMPORARY: Set to 8 for Llama2 70B
    rms_norm_eps: Optional[float] = None  # TEMPORARY
    mul_mat_q: Optional[bool] = None  # TEMPORARY
Dougie777 commented 12 months ago

Thanks that worked :)