CudaMalloc failed: out of memory with TinyLlama-1.1B

Lathanao commented 2 months ago

I am trying to make working with GPU Tinyllama with:

./TinyLlama-1.1B-Chat-v1.0.F32.llamafile -ngl 9999

But it seem not possible to allocate 66.50 MB of memory on my card, even if I just boot the machine without any use of the GPU before.

Here the error:

[...]
link_cuda_dso: note: dynamically linking /home/yo/.llamafile/ggml-cuda.so
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
link_cuda_dso: GPU support loaded
[...]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =   250.00 MiB
llm_load_tensors:      CUDA0 buffer size =  3946.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    66.50 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 66.50 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 69730304
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'TinyLlama-1.1B-Chat-v1.0.F32.gguf'
{"function":"load_model","level":"ERR","line":443,"model":"TinyLlama-1.1B-Chat-v1.0.F32.gguf","msg":"unable to load model","tid":"8545344","timestamp":1714117560}

I have the cuda in this version:

Version         : 12.3.2-1
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : LicenseRef-NVIDIA-CUDA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64  libcusolver.so=11-64  libcusolver.so=11-64
                  libcusparse.so=12-64  libcusparse.so=12-64

Here the spec of my machine.

System:
  Kernel: 6.6.26-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
  Desktop: GNOME v: 45.4 tk: GTK v: 3.24.41 Distro: Manjaro
    base: Arch Linux
Machine:
  Type: Laptop System: HP product: HP Pavilion Gaming Laptop 15-cx0xxx
Memory:
  System RAM: total: 32 GiB available: 31.24 GiB used: 4.16 GiB (13.3%)
CPU:
  Info: model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Coffee Lake
    gen: core 8 level: v3 note: 
Graphics:
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile]
    vendor: Hewlett-Packard driver: nvidia v: 550.67
    alternate: nouveau,nvidia_drm non-free: 545.xx+ status: current (as of
    2024-04; EOL~2026-12-xx) arch: Pascal code: GP10x process: TSMC 16nm
    built: 2016-2021 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 3
    speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1c8c class-ID: 0300

Is there a way to solve that?

qkiel commented 2 months ago

Try smaller version of TinyLlama, Q8 instead of F32: TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile

jart commented 2 months ago

Can you try llamafile-0.8.1 which was just released and tell me if it works?

Lathanao commented 2 months ago

Works perfectly, and it is far faster than before! Thank you.

Lathanao commented 2 months ago

Meaculpa, above, I make working a model with a lower quantization formats. And now, I am not able to run the file again without error.

So I downloaded many models. -Meta-Llama-3-8B-Instruct.F16.llamafile -> doeasn't load -Meta-Llama-3-8B-Instruct.Q2_K.llamafile -> SIGSEGV -Model/Meta-Llama-3-8B-Instruct.Q8_0.llamafile -> doeasn't load -Model/Phi-3-mini-4k-instruct.Q8_0.llamafile -> doeasn't load -Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile -> SIGSEGV -Model/TinyLlama-1.1B-Chat-v1.0.F32.llamafile -> doeasn't load -Model/TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile -> SIGSEGV

I reboot my machine and make test again. And the model what was working for me this morning (Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile), now is everytime in SIGSEGV. No way to make it working again.

The SIGSEGV issue has been report there #378

Mozilla-Ocho / llamafile

CudaMalloc failed: out of memory with TinyLlama-1.1B #372