Open zhiyixu opened 1 year ago
Try setting the exact same parameters (temp, rand_seed, n_threads, etc.) in Python that you're setting for main.exe
. The defaults for Python are slightly different.
$ python -m llama_cpp.server --help
usage: __main__.py [-h] [--model MODEL] [--model_alias MODEL_ALIAS] [--n_ctx N_CTX]
[--n_gpu_layers N_GPU_LAYERS] [--n_batch N_BATCH]
[--n_threads N_THREADS] [--f16_kv F16_KV] [--use_mlock USE_MLOCK]
[--use_mmap USE_MMAP] [--embedding EMBEDDING]
[--last_n_tokens_size LAST_N_TOKENS_SIZE] [--logits_all LOGITS_ALL]
[--cache CACHE] [--cache_size CACHE_SIZE] [--vocab_only VOCAB_ONLY]
[--verbose VERBOSE]
options:
-h, --help show this help message and exit
--model MODEL The path to the model to use for generating completions.
--model_alias MODEL_ALIAS
The alias of the model to use for generating completions.
--n_ctx N_CTX The context size. (default: 2048)
--n_gpu_layers N_GPU_LAYERS
The number of layers to put on the GPU. The rest will be on the
CPU. (default: 0)
--n_batch N_BATCH The batch size to use per eval. (default: 512)
--n_threads N_THREADS
The number of threads to use. (default: 16)
--f16_kv F16_KV Whether to use f16 key/value. (default: True)
--use_mlock USE_MLOCK
Use mlock. (default: True)
--use_mmap USE_MMAP Use mmap. (default: True)
--embedding EMBEDDING
Whether to use embeddings. (default: True)
--last_n_tokens_size LAST_N_TOKENS_SIZE
Last n tokens to keep for repeat penalty calculation. (default:
64)
--logits_all LOGITS_ALL
Whether to return logits. (default: True)
--cache CACHE Use a cache to reduce processing times for evaluated prompts.
(default: False)
--cache_size CACHE_SIZE
The size of the cache in bytes. Only used if cache is True.
(default: 2147483648)
--vocab_only VOCAB_ONLY
Whether to only return the vocabulary. (default: False)
--verbose VERBOSE Whether to print debug information. (default: True)
still the same output, the main code show below
llm = Llama(model_path=model_path, n_threads=self._n_thread,n_ctx=2048)
user_ctx = "Q:" + promote + " A: "
output = llm(user_ctx, max_tokens=256, stop=["Q:"], echo=True, temperature=0.2)
but i can't find the other two param named -c
and -ins
here are the output of main.exe -h
usage: ./main [options]
options:
-h, --help show this help message and exit
-i, --interactive run in interactive mode
--interactive-first run in interactive mode and wait for input right away
-ins, --instruct run in instruction mode (use with Alpaca models)
--multiline-input allows you to write or paste multiple lines without ending each in '\'
-r PROMPT, --reverse-prompt PROMPT
halt generation at PROMPT, return control in interactive mode
(can be specified more than once for multiple prompts).
--color colorise output to distinguish prompt and user input from generations
-s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0)
-t N, --threads N number of threads to use during computation (default: 8)
-p PROMPT, --prompt PROMPT
prompt to start generation with (default: empty)
-e process prompt escapes sequences (\n, \r, \t, \', \", \\)
--prompt-cache FNAME file to cache prompt state for faster startup (default: none)
--prompt-cache-all if specified, saves user input and generations to cache as well.
not supported with --interactive or other interactive options
--prompt-cache-ro if specified, uses the prompt cache but does not update it.
--random-prompt start with a randomized prompt.
--in-prefix STRING string to prefix user inputs with (default: empty)
--in-suffix STRING string to suffix after user inputs with (default: empty)
-f FNAME, --file FNAME
prompt file to start generation.
-n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity)
--top-k N top-k sampling (default: 40, 0 = disabled)
--top-p N top-p sampling (default: 0.9, 1.0 = disabled)
--tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
--typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.1, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--mirostat N use Mirostat sampling.
Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0)
-l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS
modifies the likelihood of token appearing in the completion,
i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
-c N, --ctx-size N size of the prompt context (default: 512)
--ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf)
--no-penalize-nl do not penalize newline token
--memory-f32 use f32 instead of f16 for memory key+value (default: disabled)
not recommended: doubles context memory required and no measurable increase in quality
--temp N temperature (default: 0.8)
-b N, --batch-size N batch size for prompt processing (default: 512)
--perplexity compute perplexity over the prompt
--keep number of tokens to keep from the initial prompt (default: 0, -1 = all)
--mlock force system to keep model in RAM rather than swapping or compressing
--no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)
-ngl
N, --n-gpu-layers N
number of layers to store in VRAM
-ts SPLIT --tensor-split SPLIT
how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1
-mg i, --main-gpu i the GPU to use for scratch and small tensors
--mtest compute maximum memory usage
--export export the computation graph to 'llama.ggml'
--verbose-prompt print prompt before generation
--lora FNAME apply LoRA adapter (implies --no-mmap)
--lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter
-m FNAME, --model FNAME
model path (default: models/7B/ggml-model.bin)
I don't think it makes sense to compare anything when using "temperature=0.2". Try temperature 0. The "c" parameter is "n_ctx". There are also many other parameters that are in play, even if you don't specify them either in llama.cpp or llama-cpp-python.
still the same output, the main code show below
llm = Llama(model_path=model_path, n_threads=self._n_thread,n_ctx=2048) user_ctx = "Q:" + promote + " A: " output = llm(user_ctx, max_tokens=256, stop=["Q:"], echo=True, temperature=0.2)
but i can't find the other two param named
-c
and-ins
here are the output of main.exe -h
usage: ./main [options] options: -h, --help show this help message and exit -i, --interactive run in interactive mode --interactive-first run in interactive mode and wait for input right away -ins, --instruct run in instruction mode (use with Alpaca models) --multiline-input allows you to write or paste multiple lines without ending each in '\' -r PROMPT, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode (can be specified more than once for multiple prompts). --color colorise output to distinguish prompt and user input from generations -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0) -t N, --threads N number of threads to use during computation (default: 8) -p PROMPT, --prompt PROMPT prompt to start generation with (default: empty) -e process prompt escapes sequences (\n, \r, \t, \', \", \\) --prompt-cache FNAME file to cache prompt state for faster startup (default: none) --prompt-cache-all if specified, saves user input and generations to cache as well. not supported with --interactive or other interactive options --prompt-cache-ro if specified, uses the prompt cache but does not update it. --random-prompt start with a randomized prompt. --in-prefix STRING string to prefix user inputs with (default: empty) --in-suffix STRING string to suffix after user inputs with (default: empty) -f FNAME, --file FNAME prompt file to start generation. -n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity) --top-k N top-k sampling (default: 40, 0 = disabled) --top-p N top-p sampling (default: 0.9, 1.0 = disabled) --tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled) --typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) --repeat-penalty N penalize repeat sequence of tokens (default: 1.1, 1.0 = disabled) --presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled) --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) --mirostat N use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1) --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0) -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion, i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello', or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' -c N, --ctx-size N size of the prompt context (default: 512) --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf) --no-penalize-nl do not penalize newline token --memory-f32 use f32 instead of f16 for memory key+value (default: disabled) not recommended: doubles context memory required and no measurable increase in quality --temp N temperature (default: 0.8) -b N, --batch-size N batch size for prompt processing (default: 512) --perplexity compute perplexity over the prompt --keep number of tokens to keep from the initial prompt (default: 0, -1 = all) --mlock force system to keep model in RAM rather than swapping or compressing --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) -ngl N, --n-gpu-layers N number of layers to store in VRAM -ts SPLIT --tensor-split SPLIT how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1 -mg i, --main-gpu i the GPU to use for scratch and small tensors --mtest compute maximum memory usage --export export the computation graph to 'llama.ggml' --verbose-prompt print prompt before generation --lora FNAME apply LoRA adapter (implies --no-mmap) --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter -m FNAME, --model FNAME model path (default: models/7B/ggml-model.bin)
I have the same question with my fine-tune model from llama. And may the param -ins
is important but is hard for me to find out how to fix it after reading the source codes of llama.cpp.
I solve this problem using the example in llama-cpp-python/examples/high_level_api/langchain_custom_llm.py
, and the result is ok.
I'm having the same problem, I don't know if it's possible to change the mode from interactive to instruction in Llama-cpp-python.
@gpxin Could you specify where you found this folder? I only have the llama-cpp and llama_cpp_python-0.2.11.dist-info and none of them have an "examples" folder.
Thanks!
@AndreCarasas not in the python pkg but in this porject, its folder path: https://github.com/abetlen/llama-cpp-python/tree/main/examples/high_level_api
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama-cpp-python
to do. what i am trying to do: i want the model translate a sentence from chinese to english for me.when i call the model with original llama.cpp with cmd
the model works fine and give the right output like: notice that the yellow line
Below is an ......
is the content for a prompt file , the file has been passed to the model with-f prompts/alpaca.txt
and i can't find this param in this project thus i can't tell whether it is the reason for this issue.Current Behavior
when i run the same thing with llama-cpp-python like this:
the output were:
you can see that in this way, the model just return the content to me instead of translate it.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Linux xxxxx 5.15.0-73-generic #80~20.04.1-Ubuntu SMP Wed May 17 14:58:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
it worked but not as the way i want, so i don't think the questions below will help thus remove them.
I can totally understand that models are bulid on probability things so they may give answers with little differentce but i still want to get some help here.
thanks in advance.