Closed kechan closed 1 year ago
That's true. Q8_0 is not supported under Metal as of now. Same for Q5_0 and Q5_1.
Just out of curiosity, is there a technical limitation to why these aren't supported, or have these just not been implemented?
No limitations - should be easy to support. PRs welcome
@ggerganov if this isn't too hard to do, I can try take a look if you give me some pointers, but I hadn't worked with c/c++ for many many years and extremely rusty. I would be interested to compare 70B q4 and q8, that's what prompted my post. Just want to check out how much quantization can degrade the biggest models.
For me it works great the llama-2-13b-chat.ggmlv3.q8_0.bin
in Mac M1 Max 64GB RAM with the pyllamacpp
package and Python 3.9
. After installing the package with pip install pyllamacpp
just run a sample code:
from pyllamacpp.model import Model
input = "I want you to act as a physician. Explain what superconductors are."
model_path='./llama-2-13b-chat.ggmlv3.q8_0.bin'
model = Model(model_path)
for token in model.generate(input):
print(token, end='', flush=True)
Output of code:
$python testLLM13B.py
llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 15237.95 MB (+ 3216.00 MB per state)
.
llama_init_from_file: kv self size = 800.00 MB
Explain their properties and the potential benefits they offer.
Superconductors are materials that exhibit zero electrical resistance when cooled below a certain temperature, known as the critical temperature (Tc). This means that superconductors can conduct electricity with perfect efficiency and without any loss of energy.
The properties of superconductors include:
1. Zero electrical resistance: Superconductors have zero electrical resistance when cooled below Tc, which makes them ideal for high-power appli as power transmission and storage.
2. Perfect diamagnetism: Superconductors expel magnetic fields when cooled below Tc, which makes them useful in MRI machines and other medical applications.
3. Quantum levitation: Superconductors can levitate above a magnet when cooled below Tc, which has potential applications in transportation and energy storage.
4. High-temperature superconductivity: Some superconductors have critical temperatures above the boiling point of liquid nitrogen (77 K), making them more practical for real-world applications.
The potential benefits of superconductors include:
1. More efficient power transmission and storage: Superconductors can transmit and store electricity with perfect efficiency, which could lead to significant energy savings and reduced carbon emissions.
2. Improved medical imaging: Superconducting magnets are used in MRI machines, which provide higher-resolution images and faster scan times than traditional magnets.
3. High-speed transportation: Superconductors could be used to create magnetic levitation trains that are faster and more efficient than conventional trains.
4. Enhanced security: Superconducting sensors can detect even slight changes in magnetic fields, which could be useful in security applications such as intrusion detection.
5. Energy storage: Superconductors could be used to store energy generated by renewable sources such as wind and solar power, which could help to reduce our reliance on fossil fuels.
Overall, superconductors have the potential to revolutionize a wide range of industries and provide significant benefits to society. However, more research is needed to fully understand their properties and potential applications.
For me it works great the
llama-2-13b-chat.ggmlv3.q8_0.bin
in Mac M1 Max 64GB RAM with thepyllamacpp
package andPython 3.9
. After installing the package withpip install pyllamacpp
just run a sample code:from pyllamacpp.model import Model input = "I want you to act as a physician. Explain what superconductors are." model_path='./llama-2-13b-chat.ggmlv3.q8_0.bin' model = Model(model_path) for token in model.generate(input): print(token, end='', flush=True)
Output of code:
$python testLLM13B.py llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 15237.95 MB (+ 3216.00 MB per state) . llama_init_from_file: kv self size = 800.00 MB Explain their properties and the potential benefits they offer. Superconductors are materials that exhibit zero electrical resistance when cooled below a certain temperature, known as the critical temperature (Tc). This means that superconductors can conduct electricity with perfect efficiency and without any loss of energy. The properties of superconductors include: 1. Zero electrical resistance: Superconductors have zero electrical resistance when cooled below Tc, which makes them ideal for high-power appli as power transmission and storage. 2. Perfect diamagnetism: Superconductors expel magnetic fields when cooled below Tc, which makes them useful in MRI machines and other medical applications. 3. Quantum levitation: Superconductors can levitate above a magnet when cooled below Tc, which has potential applications in transportation and energy storage. 4. High-temperature superconductivity: Some superconductors have critical temperatures above the boiling point of liquid nitrogen (77 K), making them more practical for real-world applications. The potential benefits of superconductors include: 1. More efficient power transmission and storage: Superconductors can transmit and store electricity with perfect efficiency, which could lead to significant energy savings and reduced carbon emissions. 2. Improved medical imaging: Superconducting magnets are used in MRI machines, which provide higher-resolution images and faster scan times than traditional magnets. 3. High-speed transportation: Superconductors could be used to create magnetic levitation trains that are faster and more efficient than conventional trains. 4. Enhanced security: Superconducting sensors can detect even slight changes in magnetic fields, which could be useful in security applications such as intrusion detection. 5. Energy storage: Superconductors could be used to store energy generated by renewable sources such as wind and solar power, which could help to reduce our reliance on fossil fuels. Overall, superconductors have the potential to revolutionize a wide range of industries and provide significant benefits to society. However, more research is needed to fully understand their properties and potential applications.
Is this running in CPU or Metal? 8bit works fine on CPU
So far, in my Mac M1 MAX 64GB ram, 10 cores cpu, 32 cores gpu:
llama-2-7b-chat.ggmlv3.q8_0.bin
, llama-2-13b-chat.ggmlv3.q8_0.bin
and llama-2-70b-chat.ggmlv3.q4_0.bin
work with CPU (do not forget the paramter n_gqa = 8
for the 70B model)llama-2-7b-chat.ggmlv3.q4_0.bin
, llama-2-13b-chat.ggmlv3.q4_0.bin
work with GPU metal.llama-2-13b-chat.ggmlv3.q8_0.bin
, llama-2-70b-chat.ggmlv3.q4_0.bin
does not work with GPU.Installation:
conda create -n llamaM1 python=3.9.16
conda activate llamaM1
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
python testM1llama.py
Working code for M1 metal GPU:
from llama_cpp import Llama
model_path = './llama-2-13b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 130)
output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True)
Code output:
llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x106d3d160
ggml_metal_init: loaded kernel_add_row 0x106d3f350
ggml_metal_init: loaded kernel_mul 0x106e05250
ggml_metal_init: loaded kernel_mul_row 0x106e05a40
ggml_metal_init: loaded kernel_scale 0x106e066a0
ggml_metal_init: loaded kernel_silu 0x106e072e0
ggml_metal_init: loaded kernel_relu 0x106e05ca0
ggml_metal_init: loaded kernel_gelu 0x106e079c0
ggml_metal_init: loaded kernel_soft_max 0x107204810
ggml_metal_init: loaded kernel_diag_mask_inf 0x106e08830
ggml_metal_init: loaded kernel_get_rows_f16 0x106e08a90
ggml_metal_init: loaded kernel_get_rows_q4_0 0x106e09400
ggml_metal_init: loaded kernel_get_rows_q4_1 0x106e09cd0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x106e0a3c0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x106e0aa90
ggml_metal_init: loaded kernel_get_rows_q4_K 0x106e0b190
ggml_metal_init: loaded kernel_get_rows_q5_K 0x106e0b890
ggml_metal_init: loaded kernel_get_rows_q6_K 0x106e0bf90
ggml_metal_init: loaded kernel_rms_norm 0x106e0c6b0
ggml_metal_init: loaded kernel_norm 0x106e0ce20
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x106e0de10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x106e0e620
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x12a7a4690
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x12a7a4cf0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x12a7a5cc0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x12a7a6480
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x12a7a6c10
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x12a7a7390
ggml_metal_init: loaded kernel_rope 0x12a7a53f0
ggml_metal_init: loaded kernel_alibi_f32 0x106d3e600
ggml_metal_init: loaded kernel_cpy_f32_f16 0x106d3f860
ggml_metal_init: loaded kernel_cpy_f32_f32 0x106d3fe30
ggml_metal_init: loaded kernel_cpy_f16_f16 0x106d40fc0
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 87.89 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.52 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 12.00 MB, ( 6996.52 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, ( 8598.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 290.00 MB, ( 8888.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 192.00 MB, ( 9080.52 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
I'm looking for the most famous mathematicians of all time, and I want to know who the most influential mathematicians are in different areas of mathematics. Please provide a list of famous mathematicians that meet my criteria:
Born between 1800 and 2000
Made significant contributions to their respective fields (such as calculus, geometry, number theory, etc.)
Are widely recognized for their work and have had a lasting impact on the field of mathematics.
Here is a list of famous mathematicians that meet your criteria:
1. Carl Friedrich Gauss (1777-1855) - Gauss made significant contributions to number theory, geometry, and calculus. He is considered one of the greatest mathematicians of all time and is known as the "prince of mathematics."
2. Georg Cantor (1845-1918) - Cantor developed the theory of set theory and transfinite numbers, which revolutionized mathematics and had a lasting impact on modern mathematics.
3. David Hilbert (1862-1943) - Hilbert is known for his work on infinite-dimensional vector spaces, calculus, and number theory. He is considered one of the most important mathematicians of the 20th century.
4. Emmy Noether (1882-1935) - Noether made significant contributions to abstract algebra and is known for her work on symmetries in physics. She is considered one of the most important female mathematicians of all time.
5. Albert Einstein (1879-1955) - Einstein is known for his work on relativity, which had a lasting impact on modern physics and mathematics. He is also known for his work on Brownian motion and the photoelectric effect.
6. Andrew Wiles (1953-present) - Wiles made headlines in 1994 when he proved Fermat's Last Theorem, which had been unsolved for over 350 years. He is considered one of the most important mathematicians of the 20th century.
7. Grigori Perelman (1966-present) - Perelman made significant contributions to the field of geometry and is known for his work on the Poincaré conjecture, which was solved in 2003. He is considered one of the most important mathematicians of the 21st century.
8. Terence Tao (1975-present) - Tao is a polymath who has made significant contributions to many areas of mathematics, including harmonic analysis, partial differential equations, and number theory. He is considered one of the most important mathematicians of the 21st century.
9. Maryam Mirzakhani (1978-2017) - Mirzakhani was a brilliant mathematician who made significant contributions to the field of geometry and is known for her work on the dynamics and symmetry of curved spaces. She was the first woman to win the Fields Medal, which is considered the most prestigious award in mathematics.
10. Ngô Bảo Châu (1972-present) - Châu is a Vietnamese-French mathematician who has made significant contributions to number theory and algebraic geometry. He was awarded the Fields Medal in 2010 for his work on the Langlands program, which is a vast web of connections between different areas of mathematics.
Please note that this is not an exhaustive list, and there are many other famous mathematicians who have made significant contributions to their respective fields. However, these individuals are widely recognized as some of the most influential mathematicians of all time
llama_print_timings: load time = 2044.25 ms
llama_print_timings: sample time = 617.99 ms / 808 runs ( 0.76 ms per token, 1307.46 tokens per second)
llama_print_timings: prompt eval time = 2044.22 ms / 24 tokens ( 85.18 ms per token, 11.74 tokens per second)
llama_print_timings: eval time = 31352.04 ms / 807 runs ( 38.85 ms per token, 25.74 tokens per second)
llama_print_timings: total time = 35253.11 ms
.
ggml_metal_free: deallocating
Non working code for M1 metal GPU:
from llama_cpp import Llama
model_path = './llama-2-70b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 130,
n_gqa = 8)
output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True)
Code output:
llama.cpp: loading model from ./llama-2-70b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 37854.96 MB (+ 640.00 MB per state)
llama_new_context_with_model: kv self size = 640.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x11ee961d0
ggml_metal_init: loaded kernel_add_row 0x11ee98480
ggml_metal_init: loaded kernel_mul 0x10eebf420
ggml_metal_init: loaded kernel_mul_row 0x10eec0120
ggml_metal_init: loaded kernel_scale 0x10eebf680
ggml_metal_init: loaded kernel_silu 0x10eec1430
ggml_metal_init: loaded kernel_relu 0x10eec0380
ggml_metal_init: loaded kernel_gelu 0x10eec1b70
ggml_metal_init: loaded kernel_soft_max 0x10eec27e0
ggml_metal_init: loaded kernel_diag_mask_inf 0x10eec2c70
ggml_metal_init: loaded kernel_get_rows_f16 0x10eec3730
ggml_metal_init: loaded kernel_get_rows_q4_0 0x10eec3df0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x10eec4680
ggml_metal_init: loaded kernel_get_rows_q2_K 0x10eec4d70
ggml_metal_init: loaded kernel_get_rows_q3_K 0x10ef93be0
ggml_metal_init: loaded kernel_get_rows_q4_K 0x10ef949d0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x104218670
ggml_metal_init: loaded kernel_get_rows_q6_K 0x10ef94c30
ggml_metal_init: loaded kernel_rms_norm 0x10ef959f0
ggml_metal_init: loaded kernel_norm 0x10ef96610
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x10ef95f90
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x10ef96ed0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x10ef97780
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x10ef985a0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x10ef98ed0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x10ef99e60
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x10ef9a5f0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x10ef9b260
ggml_metal_init: loaded kernel_rope 0x10ef9b860
ggml_metal_init: loaded kernel_alibi_f32 0x10431e330
ggml_metal_init: loaded kernel_cpy_f32_f16 0x10431ef10
ggml_metal_init: loaded kernel_cpy_f32_f32 0x10431f9e0
ggml_metal_init: loaded kernel_cpy_f16_f16 0x10431fc40
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 36864.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 412.30 MB, offs = 38439649280, (37276.75 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 24.00 MB, (37300.75 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 642.00 MB, (37942.75 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 456.00 MB, (38398.75 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 304.00 MB, (38702.75 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
GGML_ASSERT: /private/var/folders/fw/wjnxhm6n7bv6bwlk4pkxtdq00000gp/T/pip-install-lt5z7o3y/llama-cpp-python_6789b9807ac84e2ab2c3dcb9e071c493/vendor/llama.cpp/ggml-metal.m:612: ne02 == ne12
GGML_ASSERT: /private/var/folders/fw/wjnxhm6n7bv6bwlk4pkxtdq00000gp/T/pip-install-lt5z7o3y/llama-cpp-python_6789b9807ac84e2ab2c3dcb9e071c493/vendor/llama.cpp/ggml-metal.m:612: ne02 == ne12
Abort trap: 6
Got the same issue while use GPU metal(-npl 1)
,
The ggml-model-q4_0.gguf
and ggml-model-q4_1.gguf
works fine with GPU metal.
But the ggml-model-q5_0.gguf
and ggml-model-q8_0.gguf
throw an Error:
main: build = 1048 (8942dbc)
main: seed = 1692862016
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from ./models/13B/chinese-alpaca-2-13b-hf/ggml-model-q5_0.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor 0: token_embd.weight q5_0 [ 5120, 55296, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.2.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 20: blk.2.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.3.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 29: blk.3.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.4.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 38: blk.4.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 46: blk.5.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 47: blk.5.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 48: blk.5.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 49: blk.5.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 51: blk.5.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 52: blk.5.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.6.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 56: blk.6.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 57: blk.6.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 58: blk.6.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 60: blk.6.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 61: blk.6.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 64: blk.7.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 65: blk.7.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 66: blk.7.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 67: blk.7.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 69: blk.7.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 70: blk.7.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.8.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 74: blk.8.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 75: blk.8.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 76: blk.8.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 78: blk.8.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 79: blk.8.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 82: blk.9.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 83: blk.9.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 84: blk.9.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 85: blk.9.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 87: blk.9.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 88: blk.9.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 91: blk.10.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 92: blk.10.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 93: blk.10.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 94: blk.10.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 96: blk.10.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 97: blk.10.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 100: blk.11.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 101: blk.11.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 102: blk.11.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 103: blk.11.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 105: blk.11.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 106: blk.11.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 109: blk.12.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 110: blk.12.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 111: blk.12.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 112: blk.12.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 114: blk.12.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 115: blk.12.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 118: blk.13.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 119: blk.13.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 120: blk.13.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 133: blk.14.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 139: blk.15.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 141: blk.15.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 142: blk.15.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.16.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 146: blk.16.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 147: blk.16.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 148: blk.16.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 150: blk.16.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 151: blk.16.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 154: blk.17.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 155: blk.17.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 156: blk.17.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 157: blk.17.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 159: blk.17.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 160: blk.17.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.18.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 164: blk.18.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 165: blk.18.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 166: blk.18.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 168: blk.18.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 169: blk.18.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 172: blk.19.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 173: blk.19.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 174: blk.19.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 175: blk.19.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 177: blk.19.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 178: blk.19.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.20.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 182: blk.20.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 183: blk.20.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 184: blk.20.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 186: blk.20.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 187: blk.20.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 190: blk.21.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 191: blk.21.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 199: blk.22.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 200: blk.22.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 201: blk.22.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 202: blk.22.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 204: blk.22.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 205: blk.22.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 208: blk.23.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 209: blk.23.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 210: blk.23.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 211: blk.23.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 213: blk.23.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 214: blk.23.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 217: blk.24.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 218: blk.24.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 226: blk.25.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 227: blk.25.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 235: blk.26.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 236: blk.26.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 244: blk.27.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 245: blk.27.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 253: blk.28.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 254: blk.28.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 255: blk.28.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 256: blk.28.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 258: blk.28.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 259: blk.28.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 262: blk.29.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 263: blk.29.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 277: blk.30.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 280: blk.31.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 283: blk.31.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 285: blk.31.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 289: blk.32.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 290: blk.32.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 291: blk.32.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 292: blk.32.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 293: blk.32.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 294: blk.32.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 295: blk.32.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 296: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 297: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 298: blk.33.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 299: blk.33.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 300: blk.33.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 301: blk.33.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 302: blk.33.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 303: blk.33.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 304: blk.33.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 305: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 306: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 307: blk.34.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 308: blk.34.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 309: blk.34.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 310: blk.34.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 311: blk.34.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 312: blk.34.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 313: blk.34.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 314: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 315: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 316: blk.35.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 317: blk.35.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 318: blk.35.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 319: blk.35.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 320: blk.35.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 321: blk.35.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 322: blk.35.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 323: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 324: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 325: blk.36.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 326: blk.36.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 327: blk.36.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 328: blk.36.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 329: blk.36.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 330: blk.36.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 331: blk.36.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 332: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 333: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 334: blk.37.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 335: blk.37.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 336: blk.37.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 337: blk.37.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 338: blk.37.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 339: blk.37.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 340: blk.37.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 341: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 342: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 343: blk.38.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 344: blk.38.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 345: blk.38.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 346: blk.38.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 347: blk.38.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 348: blk.38.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 349: blk.38.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 350: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 351: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 352: blk.39.attn_q.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 353: blk.39.attn_k.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 354: blk.39.attn_v.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 355: blk.39.attn_output.weight q5_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 356: blk.39.ffn_gate.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 357: blk.39.ffn_up.weight q5_0 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 358: blk.39.ffn_down.weight q5_0 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 361: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 362: output.weight q6_K [ 5120, 55296, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: general.quantization_version u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q5_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V1 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 55296
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q5_0
llm_load_print_meta: model size = 13.25 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: mem required = 8727.55 MB (+ 400.00 MB per state)
.................................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/username/Documents/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x13d6087f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x13d609040 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x13d609580 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x13d609bd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x13d60a110 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x13d60a650 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x13d60ab90 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x13d60b0d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x13d60b7a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x13d60be20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x13d60c4f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13d60cd30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13d60d400 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13d60dad0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13d60e1a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13d60e870 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13d60ef40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13d60f610 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x13d60fcf0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x13d610530 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13d610e00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13d611580 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13d611d00 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13d612600 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13d612d80 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13d613500 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13d613c80 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13d614600 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13d615020 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13d6157e0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13d615fa0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13d616760 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13d616ca0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13d617460 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x13d617c20 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x13d6183e0 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_rope 0x13d618920 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x13d619200 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x13d619ab0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x13d61a360 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x13d61ac10 | th_max = 1024 | th_width = 32
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 119.41 MB
llama_new_context_with_model: max tensor size = 221.48 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 757.98 MB, offs = 8357675008, ( 8950.42 / 10922.67)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.42 MB, ( 8951.84 / 10922.67)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 9353.84 / 10922.67)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 118.02 MB, ( 9471.86 / 10922.67)
system_info: n_threads = 6 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
GGML_ASSERT: ggml-metal.m:907: false && "not implemented"
[1] 21623 abort ./main -m ./models/13B/chinese-alpaca-2-13b-hf/ggml-model-q5_0.gguf -n 512 -i
I compiled with "LLAMA_METAL=1 make" on M2 Max ./main -m ./models/13B/llama-2-13b-chat.ggmlv3.q8_0.bin -ngl 8
should at least not throw any error (I know I have to specify more specific params).
throw
GGML_ASSERT: ggml-metal.m:905: false && "not implemented" zsh: abort ./main -m ./models/13B/llama-2-13b-chat.ggmlv3.q8_0.bin --temp 0.0 -n -1 1.1
I am Apple M2 Max, got the weights from https://huggingface.co/TheBloke...
I tried q4 version and it worked. So is this not supported for q8?