PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
20.09k stars 2.24k forks source link

llama2 70B running failed #583

Open Karobben opened 1 year ago

Karobben commented 1 year ago

Thanks for your tool.

I have a problem when I run the TheBloke/Llama-2-70b-Chat-GGUF model. It loads well. But After I asked questions, it craped. Is it normal? I have dual 4090.

The error code is like:

Running on: cuda
Display Source Documents set to: False
Use history set to: False
2023-10-13 02:08:34,335 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-10-13 02:08:35,847 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
Loading Model: TheBloke/Llama-2-70b-Chat-GGUF, on: cuda
This action can take a few minutes!
Using Llamacpp for GGUF/GGML quantized models
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ./models/models--TheBloke--Llama-2-70b-Chat-GGUF/snapshots/6657b410331f892e2ea27eb435ef53aeefeedfd6/llama-2-70b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  8192, 32000,     1,     1 ]
.
.
.
llama_model_loader: - tensor  722:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_K:  441 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q4_K - Medium
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 4090) as main device
llm_load_tensors: mem required  =  140.86 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 40643 MB
....................................................................................................
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size =  561.47 MB
llama_new_context_with_model: VRAM scratch buffer: 560.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

Enter a query: give me a random word 
CUDA error 2 at /tmp/pip-install-j96ozfwa/llama-cpp-python_5dd8e3b467184a538038a703b679c8e7/vendor/llama.cpp/ggml-cuda.cu:5031: out of memory
/arrow/cpp/src/arrow/filesystem/s3fs.cc:2829:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

GPU load is:

image

Pranav-Kimbodo commented 1 year ago

Hey @Karobben

Error Message: CUDA error 100 at /var/tmp/pip-install-2walnqmv/llama-cpp-python_0209ea07c6904f1285137fa1276bae3d/vendor/llama.cpp/ggml-cuda.cu:5066: no CUDA-capable device is detected /arrow/cpp/src/arrow/filesystem/s3fs.cc:2829: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

Solution - Attempt to resolve the CUDA error by installing the necessary CUDA toolkit and PyTorch version: --> conda install -c pytorch torchvision cudatoolkit=10.1 pytorch

Karobben commented 1 year ago

Hey @Karobben

Error Message: CUDA error 100 at /var/tmp/pip-install-2walnqmv/llama-cpp-python_0209ea07c6904f1285137fa1276bae3d/vendor/llama.cpp/ggml-cuda.cu:5066: no CUDA-capable device is detected /arrow/cpp/src/arrow/filesystem/s3fs.cc:2829: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

Solution - Attempt to resolve the CUDA error by installing the necessary CUDA toolkit and PyTorch version: --> conda install -c pytorch torchvision cudatoolkit=10.1 pytorch

Thanks for your quick response. So, do you mean that the number cuda 11.8 is not compatible with this model?