llama.cpp Python bindings not working for multiple GPUs

ggerganov / llama.cpp

LLM inference in C/C++

MIT License

64.57k stars 9.24k forks source link

from llama_cpp import Llama # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system. llm = Llama( model_path="./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf", # Download the model file first n_ctx=4096, # The max sequence length to use - note that longer sequence lengths require much more resources n_gpu_layers=28 # The number of layers to offload to GPU, if you have GPU acceleration available ) # Simple inference example output = llm( "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant", # Prompt max_tokens=512, # Generate up to 512 tokens stop=["</s>"], # Example stop token - not necessarily correct for this specific model! Please check before using. echo=True # Whether to echo the prompt ) # Chat Completion API llm = Llama(model_path="./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using print(llm.create_chat_completion( messages = [ {"role": "system", "content": "You are a story writing assistant."}, { "role": "user", "content": "Write a story about llamas." } ] ))

llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 768.00 MiB llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 296.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} Using fallback chat format: None llama_print_timings: load time = 2684.34 ms llama_print_timings: sample time = 38.44 ms / 126 runs ( 0.31 ms per token, 3277.58 tokens per second) llama_print_timings: prompt eval time = 2684.26 ms / 23 tokens ( 116.71 ms per token, 8.57 tokens per second) llama_print_timings: eval time = 34871.19 ms / 125 runs ( 278.97 ms per token, 3.58 tokens per second) llama_print_timings: total time = 37863.61 ms / 148 tokens llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 96.00 MiB llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 73.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} llama_print_timings: load time = 4301.90 ms llama_print_timings: sample time = 146.72 ms / 480 runs ( 0.31 ms per token, 3271.60 tokens per second) llama_print_timings: prompt eval time = 4301.84 ms / 32 tokens ( 134.43 ms per token, 7.44 tokens per second) llama_print_timings: eval time = 142802.23 ms / 479 runs ( 298.13 ms per token, 3.35 tokens per second) llama_print_timings: total time = 148467.59 ms / 511 tokens {'id': 'chatcmpl-d83ae79e-8f2b-43c3-ac89-7dd73238b874', 'object': 'chat.completion', 'created': 1711604743, 'model': './models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '\nOnce upon a time, in the high Andes mountains of Peru, there lived a herd of llamas. They were a happy and contented group, spending their days grazing on the lush grasses that grew on the mountain slopes, and their nights huddled together for warmth under the stars.\nOne day, as they were grazing, they noticed something strange in the distance. It was a small figure, slowly making its way towards them. As it got closer, they could see that it was a young girl, carrying a small bundle in her arms. She approached the herd and called out to them in a soft voice.\n"Hello, my friends," she said. "I am lost and I have nowhere to go. Can you help me?"\nThe llamas looked at each other, unsure of what to do. But then one of them, a wise old llama named Llama Llama, stepped forward.\n"Of course we can help you," he said. "Come and join us."\nThe girl was overjoyed at their kindness and gratefully accepted their offer. She introduced herself as Maria and told them that she had been traveling with her family when they had gotten separated. She had been wandering for days, trying to find her way back to them.\nThe llamas welcomed Maria into their herd and showed her all the beautiful places they knew on the mountain. She learned how to find the tastiest grasses to eat and how to keep warm on cold nights. She even learned how to spin their soft wool into yarn and weave it into beautiful blankets.\nDays turned into weeks and weeks into months, and Maria became one of the llamas. She loved her new life with them and never wanted to leave. But one day, as they were grazing on a sunny slope, they heard the sound of hooves approaching. It was Maria\'s family, finally reunited with their lost daughter.\nMaria was overjoyed to see them but sad to leave her llama friends behind. But Llama Llama reassured her that they would always be her friends and that she could visit them anytime she wanted.\nAnd so Maria returned to her family, but she never forgot her llama friends and their kindness to her.'}, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 480, 'total_tokens': 512}}

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp Cloning into 'llama.cpp'... remote: Enumerating objects: 21342, done. remote: Counting objects: 100% (21/21), done. remote: Compressing objects: 100% (20/20), done. remote: Total 21342 (delta 8), reused 2 (delta 1), pack-reused 21321 Receiving objects: 100% (21342/21342), 24.83 MiB | 6.12 MiB/s, done. Resolving deltas: 100% (15020/15020), done.

mkdir build cd build cmake .. -DLLAMA_CUDA=ON cmake --build . --config Release -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.34.1") -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDAToolkit: /usr/local/cuda-12.2/include (found version "12.2.140") -- CUDA found -- The CUDA compiler identification is NVIDIA 11.5.119 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using CUDA architectures: 52;61;70 -- CUDA host compiler is GNU 11.4.0

Defaulting to user installation because normal site-packages is not writeable Collecting llama-cpp-python Downloading llama_cpp_python-0.2.57.tar.gz (36.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.9/36.9 MB 6.5 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting diskcache>=5.6.1 Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 KB 16.1 MB/s eta 0:00:00 Collecting typing-extensions>=4.5.0 Downloading typing_extensions-4.10.0-py3-none-any.whl (33 kB) Collecting jinja2>=2.11.3 Downloading Jinja2-3.1.3-py3-none-any.whl (133 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 KB 9.4 MB/s eta 0:00:00 Collecting numpy>=1.20.0 Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 6.7 MB/s eta 0:00:00 Collecting MarkupSafe>=2.0 Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... done Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.57-cp310-cp310-manylinux_2_35_x86_64.whl size=2776365 sha256=647af6e181c66ae6766df8064017e8fe4fc59e04f89d84c3257e043b86c1dfd2 Stored in directory: /tmp/pip-ephem-wheel-cache-kynuetyg/wheels/7e/c0/00/e98d6e198f941c623da37b3f674354cbdccfcfb2cb9cf1133d Successfully built llama-cpp-python Installing collected packages: typing-extensions, numpy, MarkupSafe, diskcache, jinja2, llama-cpp-python Attempting uninstall: typing-extensions Found existing installation: typing_extensions 4.10.0 Uninstalling typing_extensions-4.10.0: Successfully uninstalled typing_extensions-4.10.0 Attempting uninstall: numpy Found existing installation: numpy 1.26.4 Uninstalling numpy-1.26.4: Successfully uninstalled numpy-1.26.4 WARNING: The script f2py is installed in '/home/me/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Attempting uninstall: diskcache Found existing installation: diskcache 5.6.3 Uninstalling diskcache-5.6.3: Successfully uninstalled diskcache-5.6.3 Attempting uninstall: jinja2 Found existing installation: Jinja2 3.1.3 Uninstalling Jinja2-3.1.3: Successfully uninstalled Jinja2-3.1.3 Successfully installed MarkupSafe-2.1.5 diskcache-5.6.3 jinja2-3.1.3 llama-cpp-python-0.2.57 numpy-1.26.4 typing-extensions-4.10.0

/bin/python3 /home/me/llama.cpp/nous-hermes-script-test1.py llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 768.00 MiB llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 296.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} Using fallback chat format: None llama_print_timings: load time = 2777.20 ms llama_print_timings: sample time = 26.06 ms / 85 runs ( 0.31 ms per token, 3261.33 tokens per second) llama_print_timings: prompt eval time = 2777.11 ms / 23 tokens ( 120.74 ms per token, 8.28 tokens per second) llama_print_timings: eval time = 22884.22 ms / 84 runs ( 272.43 ms per token, 3.67 tokens per second) llama_print_timings: total time = 25861.04 ms / 107 tokens llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 96.00 MiB llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 73.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} llama_print_timings: load time = 3706.95 ms llama_print_timings: sample time = 143.26 ms / 480 runs ( 0.30 ms per token, 3350.50 tokens per second) llama_print_timings: prompt eval time = 3706.89 ms / 32 tokens ( 115.84 ms per token, 8.63 tokens per second) llama_print_timings: eval time = 132462.18 ms / 479 runs ( 276.54 ms per token, 3.62 tokens per second) llama_print_timings: total time = 137441.92 ms / 511 tokens {'id': 'chatcmpl-f463e00c-b3b7-4c20-b5f8-ccef20864a33', 'object': 'chat.completion', 'created': 1711632668, 'model': './models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "\nOnce upon a time, in the high Andes mountains of Peru, there lived a herd of llamas. These llamas were not like any other llamas you might have heard of. They had a secret: they could write stories!\nThe llamas would gather around every night under the stars and take turns telling each other stories. They would write them down on parchment made from llama skin and then read them aloud to each other. They had stories of adventure, romance, and mystery. They had stories of their own lives and the lives of their ancestors. They had stories that made them laugh and stories that made them cry.\nOne day, a young llama named Llama Llama joined the herd. Llama Llama was different from the others. She was shy and didn't like to talk much. She kept to herself and didn't join in on the storytelling sessions. But one night, something changed. Llama Llama overheard the other llamas talking about how they wished they could write stories that would be remembered for generations to come. Llama Llama knew she had to do something.\nThe next day, Llama Llama went off on her own and started to write. She wrote and wrote until she had written an entire book. It was a story about a llama who went on an adventure to find a magical land where all llamas could live together in peace and harmony. The other llamas were amazed when they read Llama Llama's book. They had never read anything like it before. They were so impressed that they asked Llama Llama to read it aloud to them.\nLlama Llama was nervous at first, but as she began to read, she found that she loved sharing her story with the others. They all sat around her, listening intently as she read every word. When she finished, they all cheered and applauded. They had never heard a story quite like it before. It was magical and inspiring and it made them all feel proud to be llamas.\nFrom that day on, Llama Llama became one of the most beloved members of the herd. She continued to write stories and share them with the"}, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 480, 'total_tokens': 512}}

from llama_cpp import Llama # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system. llm = Llama( model_path="./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf", # Download the model file first n_ctx=4096, # The max sequence length to use - note that longer sequence lengths require much more resources n_gpu_layers=-1 # The number of layers to offload to GPU, if you have GPU acceleration available ) # Simple inference example output = llm( "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant", # Prompt max_tokens=512, # Generate up to 512 tokens stop=["</s>"], # Example stop token - not necessarily correct for this specific model! Please check before using. echo=True # Whether to echo the prompt ) # Chat Completion API llm = Llama(model_path="./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using print(llm.create_chat_completion( messages = [ {"role": "system", "content": "You are a story writing assistant."}, { "role": "user", "content": "Write a story about llamas." } ] ))

/bin/python3 /home/me/llama.cpp/nous-hermes-script-test1.py llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 768.00 MiB llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 296.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} Using fallback chat format: None llama_print_timings: load time = 2688.22 ms llama_print_timings: sample time = 113.73 ms / 373 runs ( 0.30 ms per token, 3279.73 tokens per second) llama_print_timings: prompt eval time = 2688.14 ms / 23 tokens ( 116.88 ms per token, 8.56 tokens per second) llama_print_timings: eval time = 100433.63 ms / 372 runs ( 269.98 ms per token, 3.70 tokens per second) llama_print_timings: total time = 104075.06 ms / 395 tokens llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 96.00 MiB llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 73.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} llama_print_timings: load time = 3680.41 ms llama_print_timings: sample time = 141.97 ms / 480 runs ( 0.30 ms per token, 3380.97 tokens per second) llama_print_timings: prompt eval time = 3680.35 ms / 32 tokens ( 115.01 ms per token, 8.69 tokens per second) llama_print_timings: eval time = 129193.49 ms / 479 runs ( 269.72 ms per token, 3.71 tokens per second) llama_print_timings: total time = 134144.71 ms / 511 tokens {'id': 'chatcmpl-41984894-1cd8-4213-877d-39f3fcbfe9f1', 'object': 'chat.completion', 'created': 1711637857, 'model': './models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "\nOnce upon a time, in the high Andes mountains of Peru, there lived a herd of llamas. These llamas were not your ordinary llamas; they had been blessed with the ability to speak and think like humans. They had their own language and culture and had even built their own village. The village was made up of small huts made from llama wool and mud, and it was surrounded by lush green fields where the llamas grazed and played.\nThe leader of the llama village was a wise old llama named Llama Llama. He was respected by all the other llamas and was known for his kind heart and fair judgments. Llama Llama had two children, a son named Llama Llama Red Pajama and a daughter named Llama Llama Brown Fur Coat. They were both very smart and loved to explore the world around them.\nOne day, while exploring the nearby forest, Llama Llama Red Pajama and Llama Llama Brown Fur Coat stumbled upon a group of humans who had come to study the llama village. The humans were fascinated by the llama's ability to speak and think like humans and wanted to learn more about their culture and language. The llama children were excited to share their knowledge with the humans and invited them to visit their village.\nThe humans were amazed by what they saw. They had never seen anything like it before. The llama village was a peaceful and harmonious place where everyone worked together to make life better for all. The humans spent many days and nights with the llamas, learning their language and customs and sharing their own knowledge and experiences.\nAs time passed, the humans and llamas became good friends and even started to learn from each other. The humans taught the llamas how to read and write and introduced them to new ideas and technologies. The llamas, in turn, taught the humans how to live in harmony with nature and showed them how to build sustainable communities.\nYears went by and the llama village continued to thrive. The humans and llamas worked together to create a better world for all living things. The llama children grew up and had children of their own, who continued to learn and grow and share their knowledge"}, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 480, 'total_tokens': 512}}

llm_load_print_meta: LF token = 13 '<0x0A>' ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: GRID A100-40C, compute capability 8.0, VMM: no llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 85.94 MiB llm_load_tensors: CUDA0 buffer size = 4679.55 MiB

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0

Cloning into 'llama.cpp'... remote: Enumerating objects: 21375, done. remote: Counting objects: 100% (5712/5712), done. remote: Compressing objects: 100% (119/119), done. remote: Total 21375 (delta 5652), reused 5598 (delta 5593), pack-reused 15663 Receiving objects: 100% (21375/21375), 26.27 MiB | 6.21 MiB/s, done. Resolving deltas: 100% (15069/15069), done.

I ccache not found. Consider installing it for faster compilation. I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include I NVCCFLAGS: -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 I NVCC: Build cuda_11.5.r11.5/compiler.30672275_0 cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml.c -o ggml.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c llama.cpp -o llama.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/common.cpp -o common.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/sampling.cpp -o sampling.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/grammar-parser.cpp -o grammar-parser.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/build-info.cpp -o build-info.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/console.cpp -o console.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda.cu -o ggml-cuda.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/acc.cu -o ggml-cuda/acc.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/alibi.cu -o ggml-cuda/alibi.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/arange.cu -o ggml-cuda/arange.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/argsort.cu -o ggml-cuda/argsort.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/binbcast.cu -o ggml-cuda/binbcast.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/clamp.cu -o ggml-cuda/clamp.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/concat.cu -o ggml-cuda/concat.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/convert.cu -o ggml-cuda/convert.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/cpy.cu -o ggml-cuda/cpy.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/diagmask.cu -o ggml-cuda/diagmask.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/dmmv.cu -o ggml-cuda/dmmv.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/getrows.cu -o ggml-cuda/getrows.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/im2col.cu -o ggml-cuda/im2col.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/mmq.cu -o ggml-cuda/mmq.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/mmvq.cu -o ggml-cuda/mmvq.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/norm.cu -o ggml-cuda/norm.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/pad.cu -o ggml-cuda/pad.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/pool2d.cu -o ggml-cuda/pool2d.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/quantize.cu -o ggml-cuda/quantize.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/rope.cu -o ggml-cuda/rope.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/scale.cu -o ggml-cuda/scale.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/softmax.cu -o ggml-cuda/softmax.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/sumrows.cu -o ggml-cuda/sumrows.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/tsembd.cu -o ggml-cuda/tsembd.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/unary.cu -o ggml-cuda/unary.o nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/upscale.cu -o ggml-cuda/upscale.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml-alloc.c -o ggml-alloc.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml-backend.c -o ggml-backend.o cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c ggml-quants.c -o ggml-quants.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c unicode.cpp -o unicode.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c unicode-data.cpp -o unicode-data.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/main/main.cpp -o examples/main/main.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib ==== Run ./main -h for help. ==== g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include build-info.o ggml.o llama.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include build-info.o ggml.o llama.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/train.cpp -o train.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/simple/simple.cpp -o examples/simple/simple.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/batched/batched.cpp -o examples/batched/batched.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/json-schema-to-grammar.cpp -o json-schema-to-grammar.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/server/server.cpp -o examples/server/server.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include json-schema-to-grammar.o ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/llava/clip.cpp -o examples/llava/clip.o -Wno-cast-qual g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/llava/llava.cpp -o examples/llava/llava.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/infill/infill.cpp -o examples/infill/infill.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include build-info.o ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c common/ngram-cache.cpp -o ngram-cache.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion -c tests/test-c.c -o tests/test-c.o

Defaulting to user installation because normal site-packages is not writeable Collecting llama-cpp-python Using cached llama_cpp_python-0.2.57-cp310-cp310-manylinux_2_35_x86_64.whl Requirement already satisfied: diskcache>=5.6.1 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (5.6.3) Requirement already satisfied: numpy>=1.20.0 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (1.26.4) Requirement already satisfied: jinja2>=2.11.3 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (3.1.3) Requirement already satisfied: typing-extensions>=4.5.0 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (4.10.0) Requirement already satisfied: MarkupSafe>=2.0 in /home/me/.local/lib/python3.10/site-packages (from jinja2>=2.11.3->llama-cpp-python) (2.1.5) Installing collected packages: llama-cpp-python Successfully installed llama-cpp-python-0.2.57

/bin/python3 /home/me/llama.cpp/nous-hermes-script-test1.py llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 768.00 MiB llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 296.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} Using fallback chat format: None llama_print_timings: load time = 2554.20 ms llama_print_timings: sample time = 53.16 ms / 175 runs ( 0.30 ms per token, 3292.20 tokens per second) llama_print_timings: prompt eval time = 2554.13 ms / 23 tokens ( 111.05 ms per token, 9.01 tokens per second) llama_print_timings: eval time = 46859.94 ms / 174 runs ( 269.31 ms per token, 3.71 tokens per second) llama_print_timings: total time = 49834.15 ms / 197 tokens llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 48 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type q5_K: 289 tensors llama_model_loader: - type q6_K: 49 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 48 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 34B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 10.73 B llm_load_print_meta: model size = 7.08 GiB (5.66 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 7245.25 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 96.00 MiB llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB llama_new_context_with_model: CPU output buffer size = 62.50 MiB llama_new_context_with_model: CPU compute buffer size = 73.00 MiB llama_new_context_with_model: graph nodes = 1588 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} llama_print_timings: load time = 3587.53 ms llama_print_timings: sample time = 121.52 ms / 405 runs ( 0.30 ms per token, 3332.76 tokens per second) llama_print_timings: prompt eval time = 3587.47 ms / 32 tokens ( 112.11 ms per token, 8.92 tokens per second) llama_print_timings: eval time = 108759.34 ms / 404 runs ( 269.21 ms per token, 3.71 tokens per second) llama_print_timings: total time = 113393.73 ms / 436 tokens {'id': 'chatcmpl-bf96fa98-04d5-4817-b8d4-ad2998ff9900', 'object': 'chat.completion', 'created': 1711702434, 'model': './models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "\nOnce upon a time, in the high Andes mountains of Peru, there lived a herd of llamas. They were a happy and contented bunch, spending their days grazing on the lush grasses that grew between the rocks and boulders that dotted their mountain home.\nOne day, as they were grazing, they noticed something strange in the distance. It was a group of people, dressed in strange clothing and carrying large bags on their backs. The llamas had never seen anything like them before, and they were curious.\nAs the people got closer, the llamas could see that they were carrying food and water, and they realized that these must be travelers. The llamas had heard stories of travelers passing through their land, but they had never seen any before.\nThe travelers approached the llamas and introduced themselves as explorers from a far-off land. They explained that they were on a mission to discover new places and learn about different cultures. The llamas were fascinated by their stories and asked them many questions about their journey.\nThe explorers were impressed by the llamas' intelligence and curiosity, and they decided to stay with them for a while to learn more about their way of life. The llamas showed them how to find food and water in the mountains, and they taught them how to weave baskets from the reeds that grew near the streams.\nAs the days passed, the explorers grew to love the llamas and their mountain home. They realized that they had found something special here, something that they had been searching for all their lives.\nEventually, it was time for the explorers to continue their journey, but they promised to come back and visit the llamas again someday. The llamas waved goodbye as they watched the explorers disappear over the horizon, feeling grateful for the new friends they had made and the adventures they had shared together."}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 404, 'total_tokens': 436}}

An update on my latest effort to get this working. I uninstalled CUDA and then reinstalled everything.

nvidia-smi output: norun2

nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

which nvcc output: /usr/bin/nvcc

I'm not sure if this part is where the problem lies: which nvcc lists /usr/bin/nvcc, but I also have /usr/lib/cuda and /usr/lib/nvidia-cuda-toolkit.

I begin by setting export CUDA_DOCKER_ARCH=sm_75; otherwise, I get the error For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH. when executing make LLAMA_CUDA=1, as discussed here. Since I have two RTX 2070s, based on the information here, I used sm_75. echo $CUDA_DOCKER_ARCH outputs sm_75.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 21375, done.
remote: Counting objects: 100% (5712/5712), done.
remote: Compressing objects: 100% (119/119), done.
remote: Total 21375 (delta 5652), reused 5598 (delta 5593), pack-reused 15663
Receiving objects: 100% (21375/21375), 26.27 MiB | 6.21 MiB/s, done.
Resolving deltas: 100% (15069/15069), done.

make LLAMA_CUDA=1

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include 
I NVCCFLAGS: -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I NVCC:      Build cuda_11.5.r11.5/compiler.30672275_0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml.c -o ggml.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c llama.cpp -o llama.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/common.cpp -o common.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/sampling.cpp -o sampling.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/grammar-parser.cpp -o grammar-parser.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/build-info.cpp -o build-info.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/console.cpp -o console.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda.cu -o ggml-cuda.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/acc.cu -o ggml-cuda/acc.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/alibi.cu -o ggml-cuda/alibi.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/arange.cu -o ggml-cuda/arange.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/argsort.cu -o ggml-cuda/argsort.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/binbcast.cu -o ggml-cuda/binbcast.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/clamp.cu -o ggml-cuda/clamp.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/concat.cu -o ggml-cuda/concat.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/convert.cu -o ggml-cuda/convert.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/cpy.cu -o ggml-cuda/cpy.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/diagmask.cu -o ggml-cuda/diagmask.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/dmmv.cu -o ggml-cuda/dmmv.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/getrows.cu -o ggml-cuda/getrows.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/im2col.cu -o ggml-cuda/im2col.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/mmq.cu -o ggml-cuda/mmq.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/mmvq.cu -o ggml-cuda/mmvq.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/norm.cu -o ggml-cuda/norm.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/pad.cu -o ggml-cuda/pad.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/pool2d.cu -o ggml-cuda/pool2d.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/quantize.cu -o ggml-cuda/quantize.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/rope.cu -o ggml-cuda/rope.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/scale.cu -o ggml-cuda/scale.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/softmax.cu -o ggml-cuda/softmax.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/sumrows.cu -o ggml-cuda/sumrows.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/tsembd.cu -o ggml-cuda/tsembd.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/unary.cu -o ggml-cuda/unary.o
nvcc -std=c++11 -O3 -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_75 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler "-std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Wno-pedantic" -c ggml-cuda/upscale.cu -o ggml-cuda/upscale.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c unicode.cpp -o unicode.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c unicode-data.cpp -o unicode-data.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/main/main.cpp -o examples/main/main.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 

====  Run ./main -h for help.  ====

g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  build-info.o ggml.o llama.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  build-info.o ggml.o llama.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/train.cpp -o train.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/simple/simple.cpp -o examples/simple/simple.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/batched/batched.cpp -o examples/batched/batched.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/json-schema-to-grammar.cpp -o json-schema-to-grammar.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/server/server.cpp -o examples/server/server.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  json-schema-to-grammar.o ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib  
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/llava/clip.cpp  -o examples/llava/clip.o -Wno-cast-qual
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/llava/llava.cpp -o examples/llava/llava.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/infill/infill.cpp -o examples/infill/infill.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  build-info.o ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c common/ngram-cache.cpp -o ngram-cache.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-cuda/acc.o ggml-cuda/alibi.o ggml-cuda/arange.o ggml-cuda/argsort.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/concat.o ggml-cuda/convert.o ggml-cuda/cpy.o ggml-cuda/diagmask.o ggml-cuda/dmmv.o ggml-cuda/getrows.o ggml-cuda/im2col.o ggml-cuda/mmq.o ggml-cuda/mmvq.o ggml-cuda/norm.o ggml-cuda/pad.o ggml-cuda/pool2d.o ggml-cuda/quantize.o ggml-cuda/rope.o ggml-cuda/scale.o ggml-cuda/softmax.o ggml-cuda/sumrows.o ggml-cuda/tsembd.o ggml-cuda/unary.o ggml-cuda/upscale.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/lib/wsl/lib 
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o

I now execute CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python:

Defaulting to user installation because normal site-packages is not writeable
Collecting llama-cpp-python
  Using cached llama_cpp_python-0.2.57-cp310-cp310-manylinux_2_35_x86_64.whl
Requirement already satisfied: diskcache>=5.6.1 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (5.6.3)
Requirement already satisfied: numpy>=1.20.0 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (1.26.4)
Requirement already satisfied: jinja2>=2.11.3 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (3.1.3)
Requirement already satisfied: typing-extensions>=4.5.0 in /home/me/.local/lib/python3.10/site-packages (from llama-cpp-python) (4.10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/me/.local/lib/python3.10/site-packages (from jinja2>=2.11.3->llama-cpp-python) (2.1.5)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.2.57

And, running the script, it seems like it still doesn't work:

/bin/python3 /home/me/llama.cpp/nous-hermes-script-test1.py
llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 10.73 B
llm_load_print_meta: model size       = 7.08 GiB (5.66 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors:        CPU buffer size =  7245.25 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:        CPU  output buffer size =    62.50 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.00 MiB
llama_new_context_with_model: graph nodes  = 1588
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'}
Using fallback chat format: None

llama_print_timings:        load time =    2554.20 ms
llama_print_timings:      sample time =      53.16 ms /   175 runs   (    0.30 ms per token,  3292.20 tokens per second)
llama_print_timings: prompt eval time =    2554.13 ms /    23 tokens (  111.05 ms per token,     9.01 tokens per second)
llama_print_timings:        eval time =   46859.94 ms /   174 runs   (  269.31 ms per token,     3.71 tokens per second)
llama_print_timings:       total time =   49834.15 ms /   197 tokens
llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 10.73 B
llm_load_print_meta: model size       = 7.08 GiB (5.66 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors:        CPU buffer size =  7245.25 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:        CPU  output buffer size =    62.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    73.00 MiB
llama_new_context_with_model: graph nodes  = 1588
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'}

llama_print_timings:        load time =    3587.53 ms
llama_print_timings:      sample time =     121.52 ms /   405 runs   (    0.30 ms per token,  3332.76 tokens per second)
llama_print_timings: prompt eval time =    3587.47 ms /    32 tokens (  112.11 ms per token,     8.92 tokens per second)
llama_print_timings:        eval time =  108759.34 ms /   404 runs   (  269.21 ms per token,     3.71 tokens per second)
llama_print_timings:       total time =  113393.73 ms /   436 tokens
{'id': 'chatcmpl-bf96fa98-04d5-4817-b8d4-ad2998ff9900', 'object': 'chat.completion', 'created': 1711702434, 'model': './models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "\nOnce upon a time, in the high Andes mountains of Peru, there lived a herd of llamas. They were a happy and contented bunch, spending their days grazing on the lush grasses that grew between the rocks and boulders that dotted their mountain home.\nOne day, as they were grazing, they noticed something strange in the distance. It was a group of people, dressed in strange clothing and carrying large bags on their backs. The llamas had never seen anything like them before, and they were curious.\nAs the people got closer, the llamas could see that they were carrying food and water, and they realized that these must be travelers. The llamas had heard stories of travelers passing through their land, but they had never seen any before.\nThe travelers approached the llamas and introduced themselves as explorers from a far-off land. They explained that they were on a mission to discover new places and learn about different cultures. The llamas were fascinated by their stories and asked them many questions about their journey.\nThe explorers were impressed by the llamas' intelligence and curiosity, and they decided to stay with them for a while to learn more about their way of life. The llamas showed them how to find food and water in the mountains, and they taught them how to weave baskets from the reeds that grew near the streams.\nAs the days passed, the explorers grew to love the llamas and their mountain home. They realized that they had found something special here, something that they had been searching for all their lives.\nEventually, it was time for the explorers to continue their journey, but they promised to come back and visit the llamas again someday. The llamas waved goodbye as they watched the explorers disappear over the horizon, feeling grateful for the new friends they had made and the adventures they had shared together."}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 404, 'total_tokens': 436}}

Ok, I think it just started working. I did as described here by executing CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir:

Defaulting to user installation because normal site-packages is not writeable
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.57.tar.gz (36.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.9/36.9 MB 6.7 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting typing-extensions>=4.5.0
  Downloading typing_extensions-4.10.0-py3-none-any.whl (33 kB)
Collecting numpy>=1.20.0
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 6.7 MB/s eta 0:00:00
Collecting jinja2>=2.11.3
  Downloading Jinja2-3.1.3-py3-none-any.whl (133 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 KB 8.5 MB/s eta 0:00:00
Collecting diskcache>=5.6.1
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 KB 25.6 MB/s eta 0:00:00
Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... done
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.57-cp310-cp310-manylinux_2_35_x86_64.whl size=26229263 sha256=af3ef61406bd130f99db790713bc78015a3042ae07bd406a3235448db94330d6
  Stored in directory: /tmp/pip-ephem-wheel-cache-0ff_ypas/wheels/7e/c0/00/e98d6e198f941c623da37b3f674354cbdccfcfb2cb9cf1133d
Successfully built llama-cpp-python
Installing collected packages: typing-extensions, numpy, MarkupSafe, diskcache, jinja2, llama-cpp-python
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.10.0
    Uninstalling typing_extensions-4.10.0:
      Successfully uninstalled typing_extensions-4.10.0
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: MarkupSafe
    Found existing installation: MarkupSafe 2.1.5
    Uninstalling MarkupSafe-2.1.5:
      Successfully uninstalled MarkupSafe-2.1.5
  Attempting uninstall: diskcache
    Found existing installation: diskcache 5.6.3
    Uninstalling diskcache-5.6.3:
      Successfully uninstalled diskcache-5.6.3
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 3.1.3
    Uninstalling Jinja2-3.1.3:
      Successfully uninstalled Jinja2-3.1.3
  Attempting uninstall: llama-cpp-python
    Found existing installation: llama_cpp_python 0.2.57
    Uninstalling llama_cpp_python-0.2.57:
      Successfully uninstalled llama_cpp_python-0.2.57
Successfully installed MarkupSafe-2.1.5 diskcache-5.6.3 jinja2-3.1.3 llama-cpp-python-0.2.57 numpy-1.26.4 typing-extensions-4.10.0

And now the output of my script seems to use CUDA:

/bin/python3 /home/me/llama.cpp/nous-hermes-script-test1.py
llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 10.73 B
llm_load_print_meta: model size       = 7.08 GiB (5.66 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
  Device 1: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.50 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:      CUDA0 buffer size =  3671.41 MiB
llm_load_tensors:      CUDA1 buffer size =  3487.90 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   400.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   368.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    62.50 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   352.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.02 MiB
llama_new_context_with_model: graph nodes  = 1588
llama_new_context_with_model: graph splits = 3
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'}
Using fallback chat format: None

llama_print_timings:        load time =     254.46 ms
llama_print_timings:      sample time =      48.26 ms /   170 runs   (    0.28 ms per token,  3522.51 tokens per second)
llama_print_timings: prompt eval time =     254.39 ms /    23 tokens (   11.06 ms per token,    90.41 tokens per second)
llama_print_timings:        eval time =    4039.99 ms /   169 runs   (   23.91 ms per token,    41.83 tokens per second)
llama_print_timings:       total time =    4674.89 ms /   192 tokens
llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from ./models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 10.73 B
llm_load_print_meta: model size       = 7.08 GiB (5.66 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/49 layers to GPU
llm_load_tensors:        CPU buffer size =  7245.25 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    62.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   173.05 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph nodes  = 1588
llama_new_context_with_model: graph splits = 532
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '48', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'}

llama_print_timings:        load time =    1533.44 ms
llama_print_timings:      sample time =     117.53 ms /   384 runs   (    0.31 ms per token,  3267.31 tokens per second)
llama_print_timings: prompt eval time =    1533.38 ms /    32 tokens (   47.92 ms per token,    20.87 tokens per second)
llama_print_timings:        eval time =  105243.88 ms /   383 runs   (  274.79 ms per token,     3.64 tokens per second)
llama_print_timings:       total time =  107787.89 ms /   415 tokens
{'id': 'chatcmpl-daebdedb-d4bf-4e64-8cee-ae291841bd1a', 'object': 'chat.completion', 'created': 1711707551, 'model': './models/nous-hermes-2-solar-10.7b.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '\nOnce upon a time, in the high Andes mountains of Peru, there lived a herd of llamas. They grazed on the lush grasses and drank from the clear mountain streams. The llamas were happy and content, living their simple lives in the beautiful mountain landscape.\nOne day, a group of travelers came through the valley where the llamas lived. The travelers were amazed by the beauty of the llamas and their gentle nature. They decided to take some of the llamas with them on their journey.\nThe llamas were sad to leave their home and their friends, but they knew that they had to go with the travelers. The travelers treated them well and took good care of them. The llamas soon became accustomed to their new life and enjoyed traveling with the travelers.\nAs they journeyed through the land, the llamas saw many new and wonderful things. They saw cities and towns, forests and deserts, and even oceans and beaches. The llamas were amazed by all the different places they visited and the people they met along the way.\nEventually, the travelers and the llamas reached their destination, a beautiful valley far away from their home in the Andes. The travelers decided to settle down and start a new life in this new place. The llamas were happy to have a new home and new friends to play with.\nYears went by and the llamas grew old and wise. They passed down their stories and their wisdom to the younger llamas who came after them. The llamas became known as wise and gentle creatures who could travel far and wide and still find their way back home.\nAnd so, the llamas continued to live their lives in the beautiful valley, surrounded by their friends and loved ones, content and happy in their new home.'}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 383, 'total_tokens': 415}}

Can anyone confirm whether this shows that it's working?

I executed watch -n 1 nvidia-smi and watched as the script ran. At the beginning, both GPUs spike in usage: workboth I guess this is a positive development. But after a bit, as the script is still running, the usage of both GPUs goes to 0 for a moment: workzero And after that, it goes back to the previous situation, where the first GPU is running and the second GPU has 0 usage: work1 Is this normal? Is this actually working? Is the problem fixed?

ggerganov / llama.cpp

llama.cpp Python bindings not working for multiple GPUs #6360