Open obriensystems opened 9 months ago
https://huggingface.co/TheBloke/EstopianMaid-13B-GGUF
capybarahermes-2.5-mistral-7b.Q8_0.gguf
michaelobrien@mbp7 llama.cpp % ./main -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed = 1706993530
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 7
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 7205.84 MiB, ( 7205.91 / 21845.34)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 7205.84 MiB
llm_load_tensors: CPU buffer size = 132.82 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 64.00 MiB, ( 7270.59 / 21845.34)
llama_kv_cache_init: Metal KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CPU input buffer size = 9.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 7270.61 / 21845.34)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 80.31 MiB, ( 7350.91 / 21845.34)
llama_new_context_with_model: Metal compute buffer size = 80.30 MiB
llama_new_context_with_model: CPU compute buffer size = 8.80 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Choose your hosting service
Step 2: Purchase your domain name
Step 3: Install WordPress (or another content management system)
Step 4: Customize the design of your website
Step 5: Choose your plugins and apps
Step 6: Write and publish blog posts
Step 7: Set up social media accounts to promote your website
Step 8: Create an email list for newsletter subscribers
Step 9: Optimize your site for search engines (SEO)
Step 10: Monitor and analyze website traffic
If you’re just starting out in the world of web design, building a website can seem daunting. But don’t worry—with a bit of guidance and some elbow grease, it’s totally doable. Here are the 10 simple steps to help you get started.
Step 1: Choose your hosting service
The first step in creating any website is to choose a web hosting service. This is where your site will “live” on the internet. There are many different hosting providers out there, but some of the most popular include Bluehost, SiteGround and HostGator. When choosing a host, make sure to consider factors like price, uptime, support, and features.
Step 2: Purchase your domain name
Once you’ve decided on a web hosting service, it’s time to purchase a domain name—the unique URL that people will use to visit your site (e.g., www.yourwebsite.com). Most hosting providers offer a free domain name for the first year when you sign up for their service.
Step 3: Install WordPress (or another content management system)
WordPress is one of the most popular content management systems (CMS) used to build websites. It’s easy to use, highly customizable, and has a vast library of plugins and themes that
llama_print_timings: load time = 751.44 ms
llama_print_timings: sample time = 34.78 ms / 400 runs ( 0.09 ms per token, 11500.53 tokens per second)
llama_print_timings: prompt eval time = 94.82 ms / 19 tokens ( 4.99 ms per token, 200.38 tokens per second)
llama_print_timings: eval time = 10506.25 ms / 399 runs ( 26.33 ms per token, 37.98 tokens per second)
llama_print_timings: total time = 10709.84 ms / 418 tokens
ggml_metal_free: deallocating
Log end
sometimes we drop sampling time from 12000 tps to 6600 tps
llama_print_timings: load time = 4597.89 ms
llama_print_timings: sample time = 60.10 ms / 400 runs ( 0.15 ms per token, 6655.80 tokens per second)
llama_print_timings: prompt eval time = 95.01 ms / 19 tokens ( 5.00 ms per token, 199.98 tokens per second)
llama_print_timings: eval time = 10496.35 ms / 399 runs ( 26.31 ms per token, 38.01 tokens per second)
llama_print_timings: total time = 10767.95 ms / 418 tokens
M2Ultra
(venv-metal310) michaelobrien@MichaelacStudio llama.cpp % ./main -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1706993532
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 7
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 7205.84 MiB, ( 7205.91 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 7205.84 MiB
llm_load_tensors: CPU buffer size = 132.82 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 64.00 MiB, ( 7271.47 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CPU input buffer size = 9.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 7271.48 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 80.31 MiB, ( 7351.78 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 80.30 MiB
llama_new_context_with_model: CPU compute buffer size = 8.80 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Plan your website
Step 2: Choose a web hosting service
Step 3: Register a domain name for your site
Step 4: Create the website design and layout
Step 5: Write the content of your website
Step 6: Build and develop the website
Step 7: Add images and multimedia to your website
Step 8: Test, test, test!
Step 9: Publish your website
Step 10: Promote your website
Let’s take a look at each of these steps in more detail.
Step 1: Plan Your Website
Before you start building your site, it’s essential to have a clear plan in place. Begin by identifying the purpose of your website and who its target audience is. This will help you determine what content you need to include on your site, as well as how best to present that information. It’s also important to decide which features you want your website to have (e.g., a blog, an online store, a contact form).
Step 2: Choose a Web Hosting Service
A web hosting service is responsible for storing all the files that make up your site on a server so that it can be accessed by others over the internet. There are many different web hosting providers available, so take some time to compare their plans and pricing before choosing one that meets your needs.
Step 3: Register a Domain Name for Your Site
Your domain name is essentially the address of your website on the internet (e.g., www.example.com). You’ll need to choose a unique domain name that reflects the purpose of your site and registers it with an accredited registrar. Most web hosting providers offer domain registration services as well, so you may be able to register your domain through them.
Step 4: Create the Website Design and Layout
Once you have a clear plan in place for
llama_print_timings: load time = 343.36 ms
llama_print_timings: sample time = 31.80 ms / 400 runs ( 0.08 ms per token, 12576.64 tokens per second)
llama_print_timings: prompt eval time = 75.09 ms / 19 tokens ( 3.95 ms per token, 253.03 tokens per second)
llama_print_timings: eval time = 6685.71 ms / 399 runs ( 16.76 ms per token, 59.68 tokens per second)
llama_print_timings: total time = 6835.63 ms / 418 tokens
ggml_metal_free: deallocating
Log end
larger model fits in the M2ultra 50g at 40g (llama 70B) https://huggingface.co/meta-llama https://huggingface.co/models?sort=trending&search=gguf https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF/tree/main 40G https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF/blob/main/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf
(venv-metal310) michaelobrien@MichaelacStudio llama.cpp % ./main -m models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1706995139
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = D:\HF
llama_model_loader: - kv 2: llama.context_length u32 = 32764
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_K: 471 tensors
llama_model_loader: - type q5_K: 90 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32764
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32764
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 36.55 GiB (4.55 BPW)
llm_load_print_meta: general.name = D:\HF
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 36864.00 MiB, offs = 0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 771.84 MiB, offs = 38439649280, (37635.91 / 49152.00)
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal buffer size = 37430.75 MiB
llm_load_tensors: CPU buffer size = 140.62 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 160.00 MiB, (37797.47 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 17.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (37797.48 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 159.52 MiB, (37956.98 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 159.50 MiB
llama_new_context_with_model: CPU compute buffer size = 17.60 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Decide what you want your site to do, and what kind of content it will feature.
Step 2: Choose a domain name that represents your brand or the purpose of your site.
Step 3: Select a web hosting provider that fits your needs and budget.
Step 4: Set up your website by installing a content management system (CMS) such as WordPress, Joomla, or Drupal.
Step 5: Customize the design of your site using templates, themes, or custom coding.
Step 6: Add content to your site, including text, images, videos, and other media.
Step 7: Optimize your site for search engines by using keywords, meta tags, and other SEO techniques.
Step 8: Test your site on different devices and browsers to ensure it looks and functions properly.
Step 9: Launch your site and promote it through social media, email marketing, and other channels.
Step 10: Monitor and update your site regularly to keep it fresh and engaging for visitors. [end of text]
llama_print_timings: load time = 55100.48 ms
llama_print_timings: sample time = 19.08 ms / 228 runs ( 0.08 ms per token, 11950.31 tokens per second)
llama_print_timings: prompt eval time = 553.74 ms / 19 tokens ( 29.14 ms per token, 34.31 tokens per second)
llama_print_timings: eval time = 18283.04 ms / 227 runs ( 80.54 ms per token, 12.42 tokens per second)
llama_print_timings: total time = 18889.22 ms / 246 tokens
ggml_metal_free: deallocating
Log end
As expected the 40G model cannot load on a 32G macbook pro M1Max - we get a sawtooth 8 times and then crash - only the M2Ultra with at least 64G with 50G vRAM allocation can run the 40G model no problem. The 192g M2Ultra will top out at 79% or 153G vRAM
michaelobrien@mbp7 llama.cpp % ./main -m models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed = 1706995902
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = D:\HF
llama_model_loader: - kv 2: llama.context_length u32 = 32764
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_K: 471 tensors
llama_model_loader: - type q5_K: 90 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32764
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32764
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 36.55 GiB (4.55 BPW)
llm_load_print_meta: general.name = D:\HF
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 16384.00 MiB, offs = 0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 16384.00 MiB, offs = 16964812800
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 5072.94 MiB, offs = 33929625600, (37841.00 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal buffer size = 37430.75 MiB
llm_load_tensors: CPU buffer size = 140.62 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 160.00 MiB, (38001.69 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_kv_cache_init: Metal KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 17.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (38001.70 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 159.52 MiB, (38161.20 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: Metal compute buffer size = 159.50 MiB
llama_new_context_with_model: CPU compute buffer size = 17.60 MiB
llama_new_context_with_model: graph splits (measure): 3
ggml_metal_graph_compute: command buffer 6 failed with status 5
system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1:ggml_metal_graph_compute: command buffer 6 failed with status 5
atherineggml_metal_graph_compute: command buffer 6 failed with status 5
usto
[Restored Feb 3, 2024 at 16:46:15]
Last login: Sat Feb 3 16:46:15 on ttys001
requested formal meta access via https://huggingface.co/meta-llama/Llama-2-70b/tree/main
https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF 49G https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q5_K_M.gguf 25G https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q2_K.gguf
michaelobrien@MichaelacStudio llama.cpp % ./main -m models/codellama-70b-hf.Q5_K_M.gguf -p "factorial function in java 21" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1707071785
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama_codellama-70b-hf
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name = codellama_codellama-70b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 36864.00 MiB, offs = 0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 9663.92 MiB, offs = 38439534592, (46527.98 / 49152.00)
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal buffer size = 46322.72 MiB
llm_load_tensors: CPU buffer size = 171.96 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 160.00 MiB, (46689.55 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 17.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (46689.56 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 159.52 MiB, (46849.06 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 159.50 MiB
llama_new_context_with_model: CPU compute buffer size = 17.60 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
factorial function in java 2148
factorial function in java 6647
Factorial of a non-negative integer, is multiplication of all integers smaller than or equal to n. For example factorial of 3 is 3*2*1 which is 6.
We have discussed how to compute factorial of large numbers using BigInteger in previous post. In this post we will discuss another way to calculate factorial of large number using java program. This method can be used in programming competitions where memory and time are constraints. We need only print the result so it is not necessary to store complete result.
1) Start from leftmost digit, i = 0
2) Find value of (arr[i]*x + carry)%10 and update carry
3) If current digit becomes 0, then delete that digit from array
4) Decrement size
Following is implementation of above idea.
*/
// A function to multiply x with the number represented by arr[].
// This function uses simple school mathematics for multiplication.
// This function may value of result if the input integer has
// more digits
int multiply(int x, vector<int> &arr) {
int carry = 0; // Initialize carry
// One by one multiply n with individual digits of res[]
for (int i=0; i < arr.size(); i++) {
int prod = arr[i] * x + carry;
// Store last digit of 'prod' in res[]
arr[i] = prod % 10;
// Put rest in carry
carry = prod/10;
}
// Put carry in res and increase result size
while (carry) {
llama_print_timings: load time = 69713.02 ms
llama_print_timings: sample time = 32.47 ms / 400 runs ( 0.08 ms per token, 12320.20 tokens per second)
llama_print_timings: prompt eval time = 597.63 ms / 9 tokens ( 66.40 ms per token, 15.06 tokens per second)
llama_print_timings: eval time = 45779.20 ms / 399 runs ( 114.73 ms per token, 8.72 tokens per second)
llama_print_timings: total time = 46466.62 ms / 408 tokens
ggml_metal_free: deallocating
Log end
reruns use cached ram
llama_print_timings: load time = 2994.07 ms
llama_print_timings: sample time = 19.76 ms / 249 runs ( 0.08 ms per token, 12603.77 tokens per second)
llama_print_timings: prompt eval time = 595.16 ms / 9 tokens ( 66.13 ms per token, 15.12 tokens per second)
llama_print_timings: eval time = 28286.44 ms / 248 runs ( 114.06 ms per token, 8.77 tokens per second)
llama_print_timings: total time = 28935.86 ms / 257 tokens
ggml_metal_free: deallocating
Access granted from Meta https://huggingface.co/meta-llama/Llama-2-70b/tree/main
Google C4 dataset (800G) of processed "common crawl" - https://github.com/allenai/allennlp/discussions/5056 https://huggingface.co/spaces/optimum/llm-perf-leaderboard
https://huggingface.co/YokaiKoibito/falcon-40b-GGUF/tree/main
M2Ultra
michaelobrien@MichaelacStudio llama.cpp % ./main -m models/llama-2-13b-chat.Q8_0.gguf -p "factorial function in java 21" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1707105596
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from models/llama-2-13b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 13023.86 MiB, (13023.92 / 49152.00)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Metal buffer size = 13023.86 MiB
llm_load_tensors: CPU buffer size = 166.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 400.00 MiB, (13425.48 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 400.00 MiB
llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB
llama_new_context_with_model: CPU input buffer size = 11.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (13425.50 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 82.52 MiB, (13508.00 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 82.50 MiB
llama_new_context_with_model: CPU compute buffer size = 11.00 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
factorial function in java 21 Apr 2016
The factorial of a number is the product of that number with all the preceding integers. For example, the factorial of 4 is 4 x 3 x 2 x 1 = 24. In Java, you can write a function to calculate the factorial of an integer using recursion or loops. Here's an example of how to do it using recursion:
public static int factorial(int n) { if (n == 0) { return 1; // base case } else { return n * factorial(n-1); // recursive case } }
Explanation:
* The function takes an integer `n` as input.
* If `n` is 0, the function returns 1 (the base case).
* Otherwise, it calculates the factorial of `n-1` using the recursive call `factorial(n-1)`, and then multiplies the result by `n`.
Here's an example of how to use this function:
System.out.println(factorial(5)); // prints 120
This will calculate the factorial of 5, which is 5 x 4 x 3 x 2 x 1 = 120.
You can also write a loop-based implementation of the factorial function, like this:
public static int factorial(int n) { int result = 1; // initialization for (int i = 1; i <= n; i++) { result *= i; } return result; }
Explanation:
* The function takes an integer `n` as input.
* It initializes a variable `result` to
llama_print_timings: load time = 19859.64 ms
llama_print_timings: sample time = 33.11 ms / 400 runs ( 0.08 ms per token, 12081.67 tokens per second)
llama_print_timings: prompt eval time = 110.85 ms / 9 tokens ( 12.32 ms per token, 81.19 tokens per second)
llama_print_timings: eval time = 10988.76 ms / 399 runs ( 27.54 ms per token, 36.31 tokens per second)
llama_print_timings: total time = 11174.67 ms / 408 tokens
ggml_metal_free: deallocating
Log end
M1Max
michaelobrien@mbp7 llama.cpp % ./main -m models/llama-2-13b-chat.Q8_0.gguf -p "factorial function in java 21" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed = 1707105556
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from models/llama-2-13b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 13023.86 MiB, (13023.92 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Metal buffer size = 13023.86 MiB
llm_load_tensors: CPU buffer size = 166.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 400.00 MiB, (13424.61 / 21845.34)
llama_kv_cache_init: Metal KV buffer size = 400.00 MiB
llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB
llama_new_context_with_model: CPU input buffer size = 11.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (13424.62 / 21845.34)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 82.52 MiB, (13507.12 / 21845.34)
llama_new_context_with_model: Metal compute buffer size = 82.50 MiB
llama_new_context_with_model: CPU compute buffer size = 11.00 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
factorial function in java 21 Jan 06, 2012
Recursive Factorial Function in Java 5 examples
A recursive factorial function in Java is a function that calls itself repeatedly until it reaches the base case. Here are five examples of how to implement a recursive factorial function in Java:
Example 1: Basic Recursive Factorial Function
public static int factorial(int n) {
if (n == 0) {
return 1; // base case
else {
return n * factorial(n-1); // recursive call
}
This function takes an integer `n` as input and returns its factorial. The function checks if `n` is equal to 0, in which case the result is simply 1. Otherwise, it calculates the product of `n` and the factorial of `n-1`, which is calculated recursively.
Example 2: Factorial Function with Memoization
public static int factorial(int n) {
if (n == 0) {
return 1; // base case
else if (memoizedFactorials != null && memoizedFactorials.containsKey(n)) {
return memoizedFactorials.get(n); // use memorized value
else {
int result = n * factorial(n-1); // calculate and remember the result
memoizedFactorials.put(n, result); // cache the result
}
This function is similar to the previous example, but it uses memoization to store previously calculated values of the factorial. This can improve the performance of the function if `n` is large and the function is called many times with smaller values of `n`.
Example 3: Factorial Function with Dynamic Programming
public static int factorial(int n) {
if (n ==
llama_print_timings: load time = 15574.77 ms
llama_print_timings: sample time = 36.15 ms / 400 runs ( 0.09 ms per token, 11066.23 tokens per second)
llama_print_timings: prompt eval time = 161.53 ms / 9 tokens ( 17.95 ms per token, 55.72 tokens per second)
llama_print_timings: eval time = 18173.40 ms / 399 runs ( 45.55 ms per token, 21.96 tokens per second)
llama_print_timings: total time = 18437.33 ms / 408 tokens
ggml_metal_free: deallocating
Log end
M1Max 32 GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 400.00 MiB, (13424.61 / 21845.34)
llama_print_timings: sample time = 36.15 ms / 400 runs ( 0.09 ms per token, 11066.23 tokens per second)
llama_print_timings: prompt eval time = 161.53 ms / 9 tokens ( 17.95 ms per token, 55.72 tokens per second)
llama_print_timings: eval time = 18173.40 ms / 399 runs ( 45.55 ms per token, 21.96 tokens per second)
M2Ultra 60 GPU
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 13023.86 MiB, (13023.92 / 49152.00)
llama_print_timings: sample time = 33.11 ms / 400 runs ( 0.08 ms per token, 12081.67 tokens per second)
llama_print_timings: prompt eval time = 110.85 ms / 9 tokens ( 12.32 ms per token, 81.19 tokens per second)
llama_print_timings: eval time = 10988.76 ms / 399 runs ( 27.54 ms per token, 36.31 tokens per second)
The M2Ultra is 65% faster for eval time than the M1Max with 88% more cores
review/comment/thank-you to https://github.com/ggerganov/llama.cpp/discussions/4167
article on
add server command
michaelobrien@mbp7 llama.cpp % ./server -m models/llama-2-13b-chat.Q8_0.gguf -c 4096
vailable slots:
-> Slot 0 - max context: 4096
{"timestamp":1707182485,"level":"INFO","function":"main","line":2555,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52827,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 15 tokens
slot 0 : kv cache rm - [0, end)
print_timings: prompt eval time = 958.18 ms / 15 tokens ( 63.88 ms per token, 15.65 tokens per second)
print_timings: eval time = 3295.68 ms / 73 runs ( 45.15 ms per token, 22.15 tokens per second)
print_timings: total time = 4253.86 ms
slot 0 released (88 tokens in cache)
{"timestamp":1707182559,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52830,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 76]
slot 0 : in cache: 86 tokens | to process: 15 tokens
slot 0 : kv cache rm - [86, end)
print_timings: prompt eval time = 862.64 ms / 15 tokens ( 57.51 ms per token, 17.39 tokens per second)
print_timings: eval time = 3382.12 ms / 74 runs ( 45.70 ms per token, 21.88 tokens per second)
print_timings: total time = 4244.76 ms
slot 0 released (175 tokens in cache)
{"timestamp":1707182582,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52835,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 153]
slot 0 : in cache: 173 tokens | to process: 17 tokens
slot 0 : kv cache rm - [173, end)
print_timings: prompt eval time = 667.33 ms / 17 tokens ( 39.25 ms per token, 25.47 tokens per second)
print_timings: eval time = 6123.09 ms / 133 runs ( 46.04 ms per token, 21.72 tokens per second)
print_timings: total time = 6790.41 ms
slot 0 released (323 tokens in cache)
{"timestamp":1707182702,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52843,"status":200,"method":"POST","path":"/completion","params":{}}
michaelobrien@mbp7 tensorflow % curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "describe quantum computing at Google","n_predict": 2}'
{"content":"\n\n","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"models/llama-2-13b-chat.Q8_0.gguf","n_ctx":4096,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":true,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"models/llama-2-13b-chat.Q8_0.gguf","prompt":"describe quantum computing at Google","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":45.216,"predicted_n":2,"predicted_per_second":44.232130219391365,"predicted_per_token_ms":22.608,"prompt_ms":611.727,"prompt_n":6,"prompt_per_second":9.808296838295515,"prompt_per_token_ms":101.9545},"tokens_cached":7,"tokens_evaluated":6,"tokens_predicted":2,"truncated":false}%
Blog https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe https://www.linkedin.com/posts/michaelobrien-developer_running-70b-llama-2-llm-locally-metal-3-via-activity-7160125112103370753-dya9?utm_source=share&utm_medium=member_desktop
test git clone https://github.com/ggerganov/llama.cpp model https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF