Closed arealclimber closed 3 weeks ago
在 faith.isunfa.com 問問題的時候查看:
在 faith.isunfa.com 問問題的時候查看:
nvidia-smi -l 1
Fri Nov 1 12:52:59 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 31C P8 6W / 300W | 8936MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 12:53:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 31C P5 9W / 300W | 8936MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 12:53:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 33C P3 24W / 300W | 8936MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
在 faith.isunfa.com 問問題的時候查看:
docker compose logs ollama
ollama-1 | 2024/10/25 08:29:38 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
ollama-1 | time=2024-10-25T08:29:38.111Z level=INFO source=images.go:782 msg="total blobs: 17"
ollama-1 | time=2024-10-25T08:29:38.111Z level=INFO source=images.go:790 msg="total unused blobs removed: 0"
ollama-1 | time=2024-10-25T08:29:38.111Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
ollama-1 | time=2024-10-25T08:29:38.111Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4199835322/runners
ollama-1 | time=2024-10-25T08:29:43.089Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60102]"
ollama-1 | time=2024-10-25T08:29:43.089Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
ollama-1 | Downloading model: llama3.1
ollama-1 | time=2024-10-25T08:29:43.305Z level=INFO source=types.go:105 msg="inference compute" id=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 library=cuda compute=8.6 driver=12.6 name="NVIDIA RTX A6000" total="47.4 GiB" available="42.9 GiB"
ollama-1 | [GIN] 2024/10/25 - 08:29:43 | 200 | 28.906µs | 127.0.0.1 | HEAD "/"
ollama-1 | [GIN] 2024/10/25 - 08:29:44 | 200 | 1.287542551s | 127.0.0.1 | POST "/api/pull"
pulling manifest
ollama-1 | pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
ollama-1 | pulling 948af2743fc7... 100% ▕████████████████▏ 1.5 KB
ollama-1 | pulling 0ba8f0e314b4... 100% ▕████████████████▏ 12 KB
ollama-1 | pulling 56bb8bd477a5... 100% ▕████████████████▏ 96 B
ollama-1 | pulling 1a4c3c319823... 100% ▕████████████████▏ 485 B
ollama-1 | verifying sha256 digest
ollama-1 | writing manifest
ollama-1 | removing any unused layers
ollama-1 | success
ollama-1 | Downloading model: nomic-embed-text
ollama-1 | [GIN] 2024/10/25 - 08:29:44 | 200 | 21.401µs | 127.0.0.1 | HEAD "/"
ollama-1 | [GIN] 2024/10/25 - 08:29:45 | 200 | 776.037668ms | 127.0.0.1 | POST "/api/pull"
pulling manifest
ollama-1 | pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB
ollama-1 | pulling c71d239df917... 100% ▕████████████████▏ 11 KB
ollama-1 | pulling ce4a164fc046... 100% ▕████████████████▏ 17 B
ollama-1 | pulling 31df23ea7daa... 100% ▕████████████████▏ 420 B
ollama-1 | verifying sha256 digest
ollama-1 | writing manifest
ollama-1 | removing any unused layers
ollama-1 | success
ollama-1 | [GIN] 2024/10/25 - 08:29:54 | 200 | 565.375µs | 172.18.0.5 | GET "/api/tags"
ollama-1 | [GIN] 2024/10/25 - 08:29:54 | 200 | 16.239703ms | 172.18.0.5 | POST "/api/create"
ollama-1 | time=2024-10-25T09:15:26.833Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 parallel=4 available=46092255232 required="6.2 GiB"
ollama-1 | time=2024-10-25T09:15:26.833Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[42.9 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-10-25T09:15:26.835Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama4199835322/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 34593"
ollama-1 | time=2024-10-25T09:15:26.835Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1 | time=2024-10-25T09:15:26.835Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-10-25T09:15:26.835Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=1 commit="1e6f655" tid="125587013758976" timestamp=1729847726
ollama-1 | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="125587013758976" timestamp=1729847726 total_threads=28
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="34593" tid="125587013758976" timestamp=1729847726
ollama-1 | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
ollama-1 | llama_model_loader: - kv 2: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 3: llama.context_length u32 = 8192
ollama-1 | llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 10: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1 | llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
ollama-1 | llama_model_loader: - kv 21: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 65 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | time=2024-10-25T09:15:27.086Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | llm_load_vocab: token to piece cache size = 0.8000 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 8192
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 8192
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1 | ggml_cuda_init: found 1 CUDA devices:
ollama-1 | Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB
ollama-1 | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1 | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1 | llm_load_tensors: CPU buffer size = 281.81 MiB
ollama-1 | llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
ollama-1 | llama_new_context_with_model: n_ctx = 8192
ollama-1 | llama_new_context_with_model: n_batch = 512
ollama-1 | llama_new_context_with_model: n_ubatch = 512
ollama-1 | llama_new_context_with_model: flash_attn = 0
ollama-1 | llama_new_context_with_model: freq_base = 500000.0
ollama-1 | llama_new_context_with_model: freq_scale = 1
ollama-1 | llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
ollama-1 | llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
ollama-1 | llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB
ollama-1 | llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB
ollama-1 | llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
ollama-1 | llama_new_context_with_model: graph nodes = 1030
ollama-1 | llama_new_context_with_model: graph splits = 2
ollama-1 | INFO [main] model loaded | tid="125587013758976" timestamp=1729847729
ollama-1 | time=2024-10-25T09:15:29.093Z level=INFO source=server.go:632 msg="llama runner started in 2.26 seconds"
ollama-1 | [GIN] 2024/10/25 - 09:15:30 | 200 | 3.74223969s | 172.18.0.5 | POST "/api/chat"
ollama-1 | time=2024-10-28T03:43:19.090Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 parallel=4 available=46092255232 required="6.2 GiB"
ollama-1 | time=2024-10-28T03:43:19.091Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[42.9 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-10-28T03:43:19.093Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama4199835322/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 41219"
ollama-1 | time=2024-10-28T03:43:19.093Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1 | time=2024-10-28T03:43:19.093Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-10-28T03:43:19.093Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=1 commit="1e6f655" tid="138797087776768" timestamp=1730086999
ollama-1 | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138797087776768" timestamp=1730086999 total_threads=28
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="41219" tid="138797087776768" timestamp=1730086999
ollama-1 | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
ollama-1 | llama_model_loader: - kv 2: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 3: llama.context_length u32 = 8192
ollama-1 | llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 10: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1 | llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
ollama-1 | llama_model_loader: - kv 21: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 65 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | time=2024-10-28T03:43:19.344Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | llm_load_vocab: token to piece cache size = 0.8000 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 8192
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 8192
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1 | ggml_cuda_init: found 1 CUDA devices:
ollama-1 | Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB
ollama-1 | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1 | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1 | llm_load_tensors: CPU buffer size = 281.81 MiB
ollama-1 | llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
ollama-1 | llama_new_context_with_model: n_ctx = 8192
ollama-1 | llama_new_context_with_model: n_batch = 512
ollama-1 | llama_new_context_with_model: n_ubatch = 512
ollama-1 | llama_new_context_with_model: flash_attn = 0
ollama-1 | llama_new_context_with_model: freq_base = 500000.0
ollama-1 | llama_new_context_with_model: freq_scale = 1
ollama-1 | llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
ollama-1 | llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
ollama-1 | llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB
ollama-1 | llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB
ollama-1 | llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
ollama-1 | llama_new_context_with_model: graph nodes = 1030
ollama-1 | llama_new_context_with_model: graph splits = 2
ollama-1 | INFO [main] model loaded | tid="138797087776768" timestamp=1730087001
ollama-1 | time=2024-10-28T03:43:21.603Z level=INFO source=server.go:632 msg="llama runner started in 2.51 seconds"
ollama-1 | [GIN] 2024/10/28 - 03:43:23 | 200 | 4.168594346s | 172.18.0.5 | POST "/api/chat"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:18:46.814Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:18:46.823Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 parallel=4 available=46092255232 required="6.2 GiB"
ollama-1 | time=2024-11-01T04:18:46.824Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[42.9 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-11-01T04:18:46.826Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama4199835322/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 38787"
ollama-1 | time=2024-11-01T04:18:46.826Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1 | time=2024-11-01T04:18:46.826Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-11-01T04:18:46.826Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=1 commit="1e6f655" tid="131390944980992" timestamp=1730434726
ollama-1 | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131390944980992" timestamp=1730434726 total_threads=28
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="38787" tid="131390944980992" timestamp=1730434726
ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.type str = model
ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct
ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B
ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1
ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072
ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 66 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | time=2024-11-01T04:18:47.077Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 131072
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
ollama-1 | llm_load_tensors: ggml ctx size = 0.14 MiB
ollama-1 | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1 | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1 | llm_load_tensors: CPU buffer size = 4437.80 MiB
ollama-1 | llama_new_context_with_model: n_ctx = 8192
ollama-1 | llama_new_context_with_model: n_batch = 512
ollama-1 | llama_new_context_with_model: n_ubatch = 512
ollama-1 | llama_new_context_with_model: flash_attn = 0
ollama-1 | llama_new_context_with_model: freq_base = 500000.0
ollama-1 | llama_new_context_with_model: freq_scale = 1
ollama-1 | ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected
ollama-1 | llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB
ollama-1 | llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
ollama-1 | ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected
ollama-1 | llama_new_context_with_model: CPU output buffer size = 2.02 MiB
ollama-1 | ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected
ollama-1 | llama_new_context_with_model: CUDA_Host compute buffer size = 560.01 MiB
ollama-1 | llama_new_context_with_model: graph nodes = 1030
ollama-1 | llama_new_context_with_model: graph splits = 1
ollama-1 | INFO [main] model loaded | tid="131390944980992" timestamp=1730434728
ollama-1 | time=2024-11-01T04:18:48.586Z level=INFO source=server.go:632 msg="llama runner started in 1.76 seconds"
ollama-1 | [GIN] 2024/11/01 - 04:19:27 | 200 | 40.654990829s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:19:32 | 200 | 2.206164206s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:21:15 | 200 | 29.318784838s | 172.18.0.5 | POST "/api/chat"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:15.663Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:15.922Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:16.174Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:16.422Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:16.673Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:16.920Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:17.171Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:17.420Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:17.672Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:17.920Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:18.172Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:18.420Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:18.671Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:18.920Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:19.171Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:19.420Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:19.671Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:19.920Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:20.168Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:20.420Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:26:20.664Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.025260783 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:20.671Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:26:20.914Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.275549609 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:26:20.921Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:26:21.164Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.525302 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:42:15.038Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:42:15.047Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 parallel=4 available=46092255232 required="6.2 GiB"
ollama-1 | time=2024-11-01T04:42:15.047Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[42.9 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-11-01T04:42:15.048Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama4199835322/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 34011"
ollama-1 | time=2024-11-01T04:42:15.048Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1 | time=2024-11-01T04:42:15.048Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-11-01T04:42:15.048Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=1 commit="1e6f655" tid="135938866851840" timestamp=1730436135
ollama-1 | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="135938866851840" timestamp=1730436135 total_threads=28
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="34011" tid="135938866851840" timestamp=1730436135
ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.type str = model
ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct
ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B
ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1
ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072
ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 66 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | time=2024-11-01T04:42:15.300Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 131072
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
ollama-1 | llm_load_tensors: ggml ctx size = 0.14 MiB
ollama-1 | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1 | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1 | llm_load_tensors: CPU buffer size = 4437.80 MiB
ollama-1 | llama_new_context_with_model: n_ctx = 8192
ollama-1 | llama_new_context_with_model: n_batch = 512
ollama-1 | llama_new_context_with_model: n_ubatch = 512
ollama-1 | llama_new_context_with_model: flash_attn = 0
ollama-1 | llama_new_context_with_model: freq_base = 500000.0
ollama-1 | llama_new_context_with_model: freq_scale = 1
ollama-1 | ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected
ollama-1 | llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB
ollama-1 | llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
ollama-1 | ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected
ollama-1 | llama_new_context_with_model: CPU output buffer size = 2.02 MiB
ollama-1 | ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected
ollama-1 | llama_new_context_with_model: CUDA_Host compute buffer size = 560.01 MiB
ollama-1 | llama_new_context_with_model: graph nodes = 1030
ollama-1 | llama_new_context_with_model: graph splits = 1
ollama-1 | INFO [main] model loaded | tid="135938866851840" timestamp=1730436135
ollama-1 | time=2024-11-01T04:42:16.053Z level=INFO source=server.go:632 msg="llama runner started in 1.00 seconds"
ollama-1 | [GIN] 2024/11/01 - 04:42:17 | 200 | 2.49666101s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:45:35 | 200 | 1.189691689s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:47:53 | 200 | 34.196352536s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:48:19 | 200 | 4.118639201s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:48:32 | 200 | 6.068755787s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:48:43 | 200 | 2.668457542s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:50:12 | 200 | 4.483475919s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:50:53 | 200 | 35.139662822s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:52:58 | 200 | 47.667981312s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:53:13 | 200 | 3.950778269s | 172.18.0.5 | POST "/api/chat"
ollama-1 | [GIN] 2024/11/01 - 04:54:28 | 200 | 37.55µs | 127.0.0.1 | HEAD "/"
ollama-1 | [GIN] 2024/11/01 - 04:54:28 | 200 | 185.761µs | 127.0.0.1 | GET "/api/ps"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:13.164Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:13.423Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:13.673Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:13.923Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | [GIN] 2024/11/01 - 04:58:13 | 200 | 26.361µs | 127.0.0.1 | GET "/api/version"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:14.174Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:14.423Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:14.673Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:14.922Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:15.172Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:15.422Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:15.671Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:15.922Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:16.172Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:16.421Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:16.671Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:16.922Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:17.172Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:17.421Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:17.672Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:17.921Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:58:18.165Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.025339699 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:18.172Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:58:18.415Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.274813749 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | cuda driver library failed to get device context 800time=2024-11-01T04:58:18.421Z level=WARN source=gpu.go:403 msg="error looking up nvidia GPU memory"
ollama-1 | time=2024-11-01T04:58:18.665Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.525473279 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
nvidia-smi -l 1
log 裡看到使用率nvidia-smi -l 1
Fri Nov 1 13:54:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 49C P0 38W / 165W | 6242MiB / 16380MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:54:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 49C P0 110W / 165W | 6242MiB / 16380MiB | 92% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:54:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 50C P0 111W / 165W | 6242MiB / 16380MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:54:56 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 51C P0 111W / 165W | 6242MiB / 16380MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:54:57 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 51C P0 112W / 165W | 6242MiB / 16380MiB | 94% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:54:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 52C P0 112W / 165W | 6242MiB / 16380MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:54:59 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 52C P0 113W / 165W | 6242MiB / 16380MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:55:01 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 52C P0 113W / 165W | 6242MiB / 16380MiB | 94% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:55:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 30% 52C P0 116W / 165W | 6242MiB / 16380MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:55:03 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 30% 45C P0 104W / 165W | 6242MiB / 16380MiB | 25% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:55:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 30% 43C P0 48W / 165W | 6242MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 3141142 C ...unners/cuda_v11/ollama_llama_server 6206MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 13:55:05 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 30% 43C P0 36W / 165W | 6242MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
docker compose logs ollama
ollama-1 | 2024/11/01 06:15:13 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
ollama-1 | time=2024-11-01T06:15:13.534Z level=INFO source=images.go:782 msg="total blobs: 17"
ollama-1 | time=2024-11-01T06:15:13.534Z level=INFO source=images.go:790 msg="total unused blobs removed: 0"
ollama-1 | time=2024-11-01T06:15:13.534Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
ollama-1 | time=2024-11-01T06:15:13.535Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4171455736/runners
ollama-1 | time=2024-11-01T06:15:15.639Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60102 cpu cpu_avx cpu_avx2]"
ollama-1 | time=2024-11-01T06:15:15.639Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
ollama-1 | time=2024-11-01T06:15:15.806Z level=INFO source=types.go:105 msg="inference compute" id=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 library=cuda compute=8.6 driver=12.6 name="NVIDIA RTX A6000" total="47.4 GiB" available="38.4 GiB"
ollama-1 | Downloading model: llama3.1
ollama-1 | [GIN] 2024/11/01 - 06:15:18 | 200 | 39.284µs | 127.0.0.1 | HEAD "/"
ollama-1 | [GIN] 2024/11/01 - 06:15:19 | 200 | 1.331986068s | 127.0.0.1 | POST "/api/pull"
pulling manifest
ollama-1 | pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
ollama-1 | pulling 948af2743fc7... 100% ▕████████████████▏ 1.5 KB
ollama-1 | pulling 0ba8f0e314b4... 100% ▕████████████████▏ 12 KB
ollama-1 | pulling 56bb8bd477a5... 100% ▕████████████████▏ 96 B
ollama-1 | pulling 1a4c3c319823... 100% ▕████████████████▏ 485 B
ollama-1 | verifying sha256 digest
ollama-1 | writing manifest
ollama-1 | removing any unused layers
ollama-1 | success
ollama-1 | Downloading model: nomic-embed-text
ollama-1 | [GIN] 2024/11/01 - 06:15:19 | 200 | 19.484µs | 127.0.0.1 | HEAD "/"
ollama-1 | [GIN] 2024/11/01 - 06:15:20 | 200 | 790.582381ms | 127.0.0.1 | POST "/api/pull"
pulling manifest
ollama-1 | pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB
ollama-1 | pulling c71d239df917... 100% ▕████████████████▏ 11 KB
ollama-1 | pulling ce4a164fc046... 100% ▕████████████████▏ 17 B
ollama-1 | pulling 31df23ea7daa... 100% ▕████████████████▏ 420 B
ollama-1 | verifying sha256 digest
ollama-1 | writing manifest
ollama-1 | removing any unused layers
ollama-1 | success
ollama-1 | [GIN] 2024/11/01 - 06:15:23 | 200 | 631.376µs | 172.18.0.5 | GET "/api/tags"
ollama-1 | [GIN] 2024/11/01 - 06:15:23 | 200 | 11.933682ms | 172.18.0.5 | POST "/api/create"
ollama-1 | time=2024-11-01T06:15:59.546Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-cd886b4a-0f5b-6228-18e1-b7ad1f301043 parallel=4 available=41281388544 required="6.2 GiB"
ollama-1 | time=2024-11-01T06:15:59.546Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[38.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-11-01T06:15:59.547Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama4171455736/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 32847"
ollama-1 | time=2024-11-01T06:15:59.547Z level=INFO source=sched.go:445 msg="loaded runners" count=1
ollama-1 | time=2024-11-01T06:15:59.547Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-11-01T06:15:59.547Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=1 commit="1e6f655" tid="125381296467968" timestamp=1730441759
ollama-1 | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="125381296467968" timestamp=1730441759 total_threads=28
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="32847" tid="125381296467968" timestamp=1730441759
ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.type str = model
ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct
ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B
ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1
ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072
ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 66 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | time=2024-11-01T06:15:59.799Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 131072
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1 | ggml_cuda_init: found 1 CUDA devices:
ollama-1 | Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB
ollama-1 | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1 | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1 | llm_load_tensors: CPU buffer size = 281.81 MiB
ollama-1 | llm_load_tensors: CUDA0 buffer size = 4156.00 MiB
ollama-1 | llama_new_context_with_model: n_ctx = 8192
ollama-1 | llama_new_context_with_model: n_batch = 512
ollama-1 | llama_new_context_with_model: n_ubatch = 512
ollama-1 | llama_new_context_with_model: flash_attn = 0
ollama-1 | llama_new_context_with_model: freq_base = 500000.0
ollama-1 | llama_new_context_with_model: freq_scale = 1
ollama-1 | llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
ollama-1 | llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
ollama-1 | llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB
ollama-1 | llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB
ollama-1 | llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
ollama-1 | llama_new_context_with_model: graph nodes = 1030
ollama-1 | llama_new_context_with_model: graph splits = 2
ollama-1 | INFO [main] model loaded | tid="125381296467968" timestamp=1730441760
ollama-1 | time=2024-11-01T06:16:01.054Z level=INFO source=server.go:632 msg="llama runner started in 1.51 seconds"
nvidia-smi -l 1
Fri Nov 1 14:16:11 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 37C P0 70W / 300W | 15305MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 14:16:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 37C P0 74W / 300W | 15305MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 14:16:13 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 37C P0 79W / 300W | 15305MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 14:16:14 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 44C P0 139W / 300W | 15305MiB / 49140MiB | 91% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 14:16:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 46C P0 285W / 300W | 15305MiB / 49140MiB | 92% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 14:16:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 40C P0 171W / 300W | 15305MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
Fri Nov 1 14:16:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 40C P0 117W / 300W | 15305MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 221678 C ...cafeca/.conda/envs/flux/bin/python3 8926MiB |
| 0 N/A N/A 3526563 C ...unners/cuda_v11/ollama_llama_server 6364MiB |
+-----------------------------------------------------------------------------------------+
took 3.5 hrs done
Purpose
TODO
docker stats
)nvidia-smi -l 1
)feature/document-migration-env
branch ,用最新的 docker compose 去跑,順便測試新的 docker compose 有沒有寫錯feature/document-migration-env
branch,用最新的 docker compose 去跑,順便測試新的 docker compose 有沒有寫錯nvidia-smi -l 1
可以看到在 faith.isunfa.com 問問題的時候,GPU 的使用率,代表設定成功Check points
Solution overview
docker compose logs ollama
去看 ollama docker container log 是否有出現錯誤,再去排查 GPU 使用率或者其他nvidia-smi -l 1
去看 ollama docker container 運行之後有沒有出現在 Processes 裡、檢查在 ollama 在 inference 的時候有沒有造成的 GPU 使用率的波動