EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.99k stars 1.59k forks source link

Ubelievable long time when host the gguf mode ? #1971

Open hzgdeerHo opened 1 month ago

hzgdeerHo commented 1 month ago

lm_eval --model gguf --tasks arc_challenge --num_fewshot 25 --model_args model=codellama,base_url=http://127.0.0.1:8090 --batch_size 16 --log_samples --outp ut_path ./hzg_llama3_arc_challenge_25shot_f16GGUF --show_config --cache_requests true --use_ca che ./hzg_llama3_arc_challenge_25shot_f16GGUF --verbosity DEBUG Checking cached requests: 100%|███████████████████████████| 4687/4687 [00:00<00:00, 5980.62it/s] 0%| | 1/4687 [02:32<199:02:30, 152.91s/it]

(base) ubuntu@ly-rq-214-23-49-1:~$ python -m llama_cpp.server --config_file llama_cpp_config.json

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/ubuntu/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/Llama-3-8B-Instruct-f16.GGUF (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = c4a54320a52ed5f88b7a2f84496903ea4ff07b45 llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.block_count u32 = 32 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 32 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 12: general.file_type u32 = 1 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab:
llm_load_vocab: ****
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ****
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = c4a54320a52ed5f88b7a2f84496903ea4ff07b45 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 4 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.74 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 1002.00 MiB llm_load_tensors: CUDA0 buffer size = 3744.28 MiB llm_load_tensors: CUDA1 buffer size = 3328.25 MiB llm_load_tensors: CUDA2 buffer size = 3328.25 MiB llm_load_tensors: CUDA3 buffer size = 3914.23 MiB .......................................................................................... llama_new_context_with_model: n_ctx = 16128 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 567.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 504.00 MiB llama_kv_cache_init: CUDA2 KV buffer size = 504.00 MiB llama_kv_cache_init: CUDA3 KV buffer size = 441.00 MiB llama_new_context_with_model: KV self size = 2016.00 MiB, K (f16): 1008.00 MiB, V (f16): 1008.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: CUDA0 compute buffer size = 245.76 MiB llama_new_context_with_model: CUDA1 compute buffer size = 175.01 MiB llama_new_context_with_model: CUDA2 compute buffer size = 175.01 MiB llama_new_context_with_model: CUDA3 compute buffer size = 353.52 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 5 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128009', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'llama.context_length': '8192', 'general.name': 'c4a54320a52ed5f88b7a2f84496903ea4ff07b45', 'llama.vocab_size': '128256', 'general.file_type': '1', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'} Available chat formats from metadata: chat_template.default INFO: Started server process [835603] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)

4X RTX 4090

llamacpp config file :

{ "host": "0.0.0.0", "port": 8090, "models": [ { "model": "/home/ubuntu/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/Llama-3-8B-Instruct-f16.GGUF", "model_alias": "codellama", "chat_format": "llama-3", "n_gpu_layers": -1, "offload_kqv": true, "n_threads": 12, "n_batch":512, "flash_attn":1, "n_ctx": 16000 } ] }

LSinev commented 1 month ago

Why do you think the problem is with lm-evaluation-harness and should be reported here? Did you search other GGUF related issues for the same problem and solutions? For example, https://github.com/EleutherAI/lm-evaluation-harness/issues/1472

hzgdeerHo commented 1 month ago

THANS ,I just can not figure out the solution.