ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.18k stars 9.64k forks source link

Request support for LLaMA-2-7B-32K #2530

Closed apcameron closed 1 year ago

apcameron commented 1 year ago

LLaMA-2-7B-32K Model Description

LLaMA-2-7B-32K is an open-source, long context language model developed by Together, fine-tuned from Meta's original Llama-2 7B model. This model represents our efforts to contribute to the rapid progress of the open-source ecosystem for large language models. The model has been extended to a context length of 32K with position interpolation, allowing applications on multi-document QA, long text summarization, etc The model is available here

klosax commented 1 year ago

It should work by using the parameter --rope-freq-scale 8.0

apcameron commented 1 year ago

@klosax Have you tried it. If so what exactly did you do as it does not work for me

klosax commented 1 year ago

No I have not tried it, I was just looking at the model config.json . What does not work? Have you tried without quantization using F32 or F16?

apcameron commented 1 year ago

Here is what it does

./main --rope-freq-scale 8.0 -m models/ggml-model-f16.bin -p "What is a Llama?"
main: warning: scaling RoPE frequency by 8 (default 1.0)
main: build = 963 (93356bd)
main: seed  = 1691415624
llama.cpp: loading model from models/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 5504
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 8
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 12853.10 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.84 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 What is a Llama?!?!
Igoorx commented 1 year ago

@apcameron Actually, it isn't --rope-freq-scale 8.0, it should be --rope-freq-scale 0.125 (i.e. 1/8)

klosax commented 1 year ago

Actually, it isn't --rope-freq-scale 8.0, it should be --rope-freq-scale 0.125 (i.e. 1/8)

You are right, looking at the PR https://github.com/ggerganov/llama.cpp/pull/2054 it sure looks that I missed something.

So extending context length from 4k to 32k is a ctx_scale of 8.0. Now we have 2 parameters to set for that to work according to the PR:

--rope-freq-scale = 1/ctx_scale = 1/8.0 = 0.125 --rope-freq-base = 10000 x ctx_scale = 80000

If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base parameter is needed but please report your findings.

apcameron commented 1 year ago

Thank you --rope-freq-scale 0.125 works

Igoorx commented 1 year ago

If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base parameter is needed but please report your findings.

rope-freq-base shouldn't be used together with rope-freq-scale, rope-freq-base is used for NTK-Aware scaling and rope-freq-scale is used for linear scaling, so if you use the two together you're basically applying a 64x scaling.

klosax commented 1 year ago

rope-freq-base shouldn't be used together with rope-freq-scale

Ok. Thank you.

--rope-freq-scale 0.125 works

Great. I think we should have a parameter that is the inverse of this, since it would make more sense and be in line with the parameters in the HF config.json:

"rope_scaling": {
    "factor": 8.0,
    "type": "linear"
  }
klosax commented 1 year ago

PR added https://github.com/ggerganov/llama.cpp/pull/2544

MUZAMMILPERVAIZ commented 1 year ago

Hi , Can anyone share a sample code how to use these scaling parameters while loading llama 2 13b chat model from. Huggingface?

klosax commented 1 year ago

https://github.com/ggerganov/llama.cpp/tree/master/examples/main#extended-context-size

MUZAMMILPERVAIZ commented 1 year ago

thanks for your response. but I want this without llama.pp. Like in this code:

`from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

model = LlamaForCausalLM.from_pretrained( MODEL_NAME, device_map="auto", trust_remote_code=True, quantization_config=bnb_config, )

tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME) tokenizer.pad_token = tokenizer.eos_token`