abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.17k stars 974 forks source link

Long Context Generation Crashes Google Colab Instance #1792

Open kazunator opened 1 month ago

kazunator commented 1 month ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

I was trying to test kv cache quantization on long context generation to see how much VRAM it saves. The expected behavior is for the code to either run properly or give an error trace that tells me what went wrong.

Current Behavior

The code crashes without giving an error.

Environment and Context

The environment is google colab. The code that I tried to run is as follow:

$ lscpu

$ uname -a

$ python3 --version
$ make --version
$ g++ --version

Failure Information (for bugs)

No failure information has been given. But the model information has been printed out:

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8b Instruct llama_model_loader: - kv 3: general.organization str = Unsloth llama_model_loader: - kv 4: general.finetune str = instruct llama_model_loader: - kv 5: general.basename str = meta-llama-3.1 llama_model_loader: - kv 6: general.size_label str = 8B llama_model_loader: - kv 7: llama.block_count u32 = 32 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: general.file_type u32 = 2 llama_model_loader: - kv 16: llama.vocab_size u32 = 128256 llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 18: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 19: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 20: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 22: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 26: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 27: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8b Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: CPU buffer size = 4437.80 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1

Steps to Reproduce

Here is the pip installations:

!pip install -q git+https://github.com/abetlen/llama-cpp-python
!pip install huggingface-hub
!pip install -q flash-attn --no-build-isolation
!pip install datasets accelerate 
!pip install -q git+https://github.com/huggingface/transformers

Here is the Code:

import json
import time
from llama_cpp import Llama

from datasets import load_dataset
from transformers import AutoTokenizer
import torch

repo_id = "Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2-GGUF"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("aifeifei798/DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored", padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load dataset
dataset = load_dataset('THUDM/LongBench', "samsum", split='test')
very_long_context = " ".join(dataset["context"])

# Tokenize and decode the context
inputs = tokenizer(very_long_context, max_length=10000, truncation="only_first", return_tensors="pt")
decoded_context = tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True)

llm = Llama.from_pretrained(repo_id=repo_id, filename = "*Llama-3.1-8B-Lexi-Uncensored_V2_Q4.gguf", flash_attn = True, type_k = 4, type_v = 4, max_tokens=11000 , context_window=11000,
    n_gpu_layers=-1)
start_time = time.time()
output = llm(
    prompt = decoded_context
)
end_time = time.time()

time_taken = end_time - start_time

print(f"Time taken: {time_taken:.2f} seconds")

print(json.dumps(output, indent=2))