Closed apcameron closed 1 year ago
It should work by using the parameter --rope-freq-scale 8.0
@klosax Have you tried it. If so what exactly did you do as it does not work for me
No I have not tried it, I was just looking at the model config.json . What does not work? Have you tried without quantization using F32 or F16?
Here is what it does
./main --rope-freq-scale 8.0 -m models/ggml-model-f16.bin -p "What is a Llama?" main: warning: scaling RoPE frequency by 8 (default 1.0) main: build = 963 (93356bd) main: seed = 1691415624 llama.cpp: loading model from models/ggml-model-f16.bin llama_model_load_internal: format = ggjt v1 (pre #1405) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 5504 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 8 llama_model_load_internal: ftype = 1 (mostly F16) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 12853.10 MB (+ 256.00 MB per state) llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.84 MB system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 What is a Llama?!?!
@apcameron Actually, it isn't --rope-freq-scale 8.0
, it should be --rope-freq-scale 0.125
(i.e. 1/8)
Actually, it isn't
--rope-freq-scale 8.0
, it should be--rope-freq-scale 0.125
(i.e. 1/8)
You are right, looking at the PR https://github.com/ggerganov/llama.cpp/pull/2054 it sure looks that I missed something.
So extending context length from 4k to 32k is a ctx_scale of 8.0. Now we have 2 parameters to set for that to work according to the PR:
--rope-freq-scale
= 1/ctx_scale = 1/8.0 = 0.125
--rope-freq-base
= 10000 x ctx_scale = 80000
If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base
parameter is needed but please report your findings.
Thank you --rope-freq-scale 0.125 works
If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the
rope-freq-base
parameter is needed but please report your findings.
rope-freq-base
shouldn't be used together with rope-freq-scale
, rope-freq-base
is used for NTK-Aware scaling and rope-freq-scale
is used for linear scaling, so if you use the two together you're basically applying a 64x scaling.
rope-freq-base
shouldn't be used together withrope-freq-scale
Ok. Thank you.
--rope-freq-scale 0.125 works
Great. I think we should have a parameter that is the inverse of this, since it would make more sense and be in line with the parameters in the HF config.json:
"rope_scaling": {
"factor": 8.0,
"type": "linear"
}
Hi , Can anyone share a sample code how to use these scaling parameters while loading llama 2 13b chat model from. Huggingface?
thanks for your response. but I want this without llama.pp. Like in this code:
`from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
model = LlamaForCausalLM.from_pretrained( MODEL_NAME, device_map="auto", trust_remote_code=True, quantization_config=bnb_config, )
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME) tokenizer.pad_token = tokenizer.eos_token`
LLaMA-2-7B-32K Model Description
LLaMA-2-7B-32K is an open-source, long context language model developed by Together, fine-tuned from Meta's original Llama-2 7B model. This model represents our efforts to contribute to the rapid progress of the open-source ecosystem for large language models. The model has been extended to a context length of 32K with position interpolation, allowing applications on multi-document QA, long text summarization, etc The model is available here