ObrienlabsDev / machine-learning

Machine Learning - AI - Tensorflow - Keras - NVidia - Google
MIT License
0 stars 0 forks source link

llama.cpp on Mac Silicon M1Max and M2Ultra #7

Open obriensystems opened 9 months ago

obriensystems commented 9 months ago

Blog https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe https://www.linkedin.com/posts/michaelobrien-developer_running-70b-llama-2-llm-locally-metal-3-via-activity-7160125112103370753-dya9?utm_source=share&utm_medium=member_desktop

test git clone https://github.com/ggerganov/llama.cpp model https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF

michaelobrien@mbp7 llama.cpp % ./main -m models/capybarahermes-2.5-mistral-7b.Q4_K_M.gguf -p "The capital of paris is" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed  = 1706913125
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4095.08 MiB, ( 4095.14 / 21845.34)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  4095.06 MiB
llm_load_tensors:        CPU buffer size =    70.32 MiB
................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 4159.83 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 4159.84 / 21845.34)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    80.31 MiB, ( 4240.14 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =    80.30 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 The capital of paris is known for its romantic atmosphere and its beautiful architecture. The city boasts numerous museums and galleries, including the Louvre, which houses some of the world’s most famous works of art.

In addition to the iconic Eiffel Tower, there are plenty of other attractions that make Paris an unforgettable destination for tourists. From the Notre-Dame Cathedral and Sacré-Cœur Basilica to the Champs-Élysées shopping district and Montmartre’s artistic quarter, there is something for everyone in Paris.

If you’re planning a trip to this beautiful city, here are some tips to help make your visit as enjoyable as possible:

1. Plan ahead – Do your research beforehand and plan out the attractions you want to see, the neighborhoods you want to explore and the restaurants you want to dine at. This will save you time and ensure that you don’t miss anything important during your stay in Paris.

2. Get a Paris Pass – If you’re planning on seeing several tourist attractions, consider getting a Paris Pass. This pass gives you access to over 60 top attractions in the city, including the Louvre, the Eiffel Tower and Versailles Palace, as well as unlimited use of public transportation for the duration of your stay.

3. Visit off-season – If possible, plan your trip during the low season (November to March), when the crowds are smaller, lines are shorter and accommodation prices are lower.

4. Take a guided tour – A guided tour can be a great way to learn about the history and culture of Paris while also getting insider tips on the best places to visit. There are many types of tours available, including walking tours, bus tours and food tours.

5. Explore the neighborhoods – Paris is divided into 20 arrond
llama_print_timings:        load time =    1217.90 ms
llama_print_timings:      sample time =      34.66 ms /   400 runs   (    0.09 ms per token, 11542.35 tokens per second)
llama_print_timings: prompt eval time =     105.91 ms /     7 tokens (   15.13 ms per token,    66.09 tokens per second)
llama_print_timings:        eval time =    8428.90 ms /   399 runs   (   21.13 ms per token,    47.34 tokens per second)
llama_print_timings:       total time =    8642.72 ms /   406 tokens
ggml_metal_free: deallocating
Log end

michaelobrien@mbp7 llama.cpp % ./main -m models/capybarahermes-2.5-mistral-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed  = 1706913176
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4095.08 MiB, ( 4095.14 / 21845.34)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  4095.06 MiB
llm_load_tensors:        CPU buffer size =    70.32 MiB
................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 4159.83 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 4159.84 / 21845.34)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    80.31 MiB, ( 4240.14 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =    80.30 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1: Determine the purpose of your website.
Step 2: Choose a platform to build your website.
Step 3: Register your domain name.
Step 4: Select a hosting provider.
Step 5: Choose a theme for your website.
Step 6: Install necessary plugins.
Step 7: Create content for your website.
Step 8: Design the layout of your website.
Step 9: Test and optimize your website.
Step 10: Launch your website and promote it.

Building a website is an essential step in creating an online presence for your business, brand or personal blog. Follow this simple guide to learn how to build a website in just 10 steps:

Step 1: Determine the purpose of your website.
Before you start building a website, it’s crucial to determine its purpose. Is it for selling products, sharing information or promoting a business? Knowing the goal of your website will help you make informed decisions when designing and building it.

Step 2: Choose a platform to build your website.
There are many website-building platforms available, such as WordPress, Wix, Shopify or Squarespace. Each platform has its own pros and cons, so choose one that suits your needs and budget. For instance, WordPress is popular for its flexibility, while Wix is known for its user-friendly interface.

Step 3: Register your domain name.
Your domain name is the address people will use to access your website (e.g., google.com). Choose a name that’s easy to remember and represents your brand or purpose. You can register your domain name through a domain registrar such as GoDaddy, Namecheap or Google Domains.

Step 4: Select a hosting provider.
A website needs to be hosted on a server to make it accessible online. Select a
llama_print_timings:        load time =     309.51 ms
llama_print_timings:      sample time =      34.11 ms /   400 runs   (    0.09 ms per token, 11726.42 tokens per second)
llama_print_timings: prompt eval time =     109.07 ms /    19 tokens (    5.74 ms per token,   174.20 tokens per second)
llama_print_timings:        eval time =    8389.33 ms /   399 runs   (   21.03 ms per token,    47.56 tokens per second)
llama_print_timings:       total time =    8610.81 ms /   418 tokens
ggml_metal_free: deallocating
Log end
obriensystems commented 9 months ago

https://huggingface.co/TheBloke/EstopianMaid-13B-GGUF

capybarahermes-2.5-mistral-7b.Q8_0.gguf

michaelobrien@mbp7 llama.cpp % ./main -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed  = 1706993530
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  7205.84 MiB, ( 7205.91 / 21845.34)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  7205.84 MiB
llm_load_tensors:        CPU buffer size =   132.82 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 7270.59 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 7270.61 / 21845.34)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    80.31 MiB, ( 7350.91 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =    80.30 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1: Choose your hosting service
Step 2: Purchase your domain name
Step 3: Install WordPress (or another content management system)
Step 4: Customize the design of your website
Step 5: Choose your plugins and apps
Step 6: Write and publish blog posts
Step 7: Set up social media accounts to promote your website
Step 8: Create an email list for newsletter subscribers
Step 9: Optimize your site for search engines (SEO)
Step 10: Monitor and analyze website traffic

If you’re just starting out in the world of web design, building a website can seem daunting. But don’t worry—with a bit of guidance and some elbow grease, it’s totally doable. Here are the 10 simple steps to help you get started.

Step 1: Choose your hosting service
The first step in creating any website is to choose a web hosting service. This is where your site will “live” on the internet. There are many different hosting providers out there, but some of the most popular include Bluehost, SiteGround and HostGator. When choosing a host, make sure to consider factors like price, uptime, support, and features.

Step 2: Purchase your domain name
Once you’ve decided on a web hosting service, it’s time to purchase a domain name—the unique URL that people will use to visit your site (e.g., www.yourwebsite.com). Most hosting providers offer a free domain name for the first year when you sign up for their service.

Step 3: Install WordPress (or another content management system)
WordPress is one of the most popular content management systems (CMS) used to build websites. It’s easy to use, highly customizable, and has a vast library of plugins and themes that
llama_print_timings:        load time =     751.44 ms
llama_print_timings:      sample time =      34.78 ms /   400 runs   (    0.09 ms per token, 11500.53 tokens per second)
llama_print_timings: prompt eval time =      94.82 ms /    19 tokens (    4.99 ms per token,   200.38 tokens per second)
llama_print_timings:        eval time =   10506.25 ms /   399 runs   (   26.33 ms per token,    37.98 tokens per second)
llama_print_timings:       total time =   10709.84 ms /   418 tokens
ggml_metal_free: deallocating
Log end

sometimes we drop sampling time from 12000 tps to 6600 tps

llama_print_timings:        load time =    4597.89 ms
llama_print_timings:      sample time =      60.10 ms /   400 runs   (    0.15 ms per token,  6655.80 tokens per second)
llama_print_timings: prompt eval time =      95.01 ms /    19 tokens (    5.00 ms per token,   199.98 tokens per second)
llama_print_timings:        eval time =   10496.35 ms /   399 runs   (   26.31 ms per token,    38.01 tokens per second)
llama_print_timings:       total time =   10767.95 ms /   418 tokens

M2Ultra

(venv-metal310) michaelobrien@MichaelacStudio llama.cpp % ./main -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed  = 1706993532
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  7205.84 MiB, ( 7205.91 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  7205.84 MiB
llm_load_tensors:        CPU buffer size =   132.82 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 7271.47 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 7271.48 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    80.31 MiB, ( 7351.78 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =    80.30 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1: Plan your website
Step 2: Choose a web hosting service
Step 3: Register a domain name for your site
Step 4: Create the website design and layout
Step 5: Write the content of your website
Step 6: Build and develop the website
Step 7: Add images and multimedia to your website
Step 8: Test, test, test!
Step 9: Publish your website
Step 10: Promote your website

Let’s take a look at each of these steps in more detail.

Step 1: Plan Your Website
Before you start building your site, it’s essential to have a clear plan in place. Begin by identifying the purpose of your website and who its target audience is. This will help you determine what content you need to include on your site, as well as how best to present that information. It’s also important to decide which features you want your website to have (e.g., a blog, an online store, a contact form).

Step 2: Choose a Web Hosting Service
A web hosting service is responsible for storing all the files that make up your site on a server so that it can be accessed by others over the internet. There are many different web hosting providers available, so take some time to compare their plans and pricing before choosing one that meets your needs.

Step 3: Register a Domain Name for Your Site
Your domain name is essentially the address of your website on the internet (e.g., www.example.com). You’ll need to choose a unique domain name that reflects the purpose of your site and registers it with an accredited registrar. Most web hosting providers offer domain registration services as well, so you may be able to register your domain through them.

Step 4: Create the Website Design and Layout
Once you have a clear plan in place for
llama_print_timings:        load time =     343.36 ms
llama_print_timings:      sample time =      31.80 ms /   400 runs   (    0.08 ms per token, 12576.64 tokens per second)
llama_print_timings: prompt eval time =      75.09 ms /    19 tokens (    3.95 ms per token,   253.03 tokens per second)
llama_print_timings:        eval time =    6685.71 ms /   399 runs   (   16.76 ms per token,    59.68 tokens per second)
llama_print_timings:       total time =    6835.63 ms /   418 tokens
ggml_metal_free: deallocating
Log end
obriensystems commented 9 months ago

larger model fits in the M2ultra 50g at 40g (llama 70B) https://huggingface.co/meta-llama https://huggingface.co/models?sort=trending&search=gguf https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF/tree/main 40G https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF/blob/main/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf

(venv-metal310) michaelobrien@MichaelacStudio llama.cpp % ./main -m models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed  = 1706995139
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = D:\HF
llama_model_loader: - kv   2:                       llama.context_length u32              = 32764
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 14
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_K:  471 tensors
llama_model_loader: - type q5_K:   90 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.55 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = D:\HF
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 36864.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   771.84 MiB, offs =  38439649280, (37635.91 / 49152.00)
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      Metal buffer size = 37430.75 MiB
llm_load_tensors:        CPU buffer size =   140.62 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   160.00 MiB, (37797.47 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (37797.48 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   159.52 MiB, (37956.98 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =   159.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.60 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1: Decide what you want your site to do, and what kind of content it will feature.
Step 2: Choose a domain name that represents your brand or the purpose of your site.
Step 3: Select a web hosting provider that fits your needs and budget.
Step 4: Set up your website by installing a content management system (CMS) such as WordPress, Joomla, or Drupal.
Step 5: Customize the design of your site using templates, themes, or custom coding.
Step 6: Add content to your site, including text, images, videos, and other media.
Step 7: Optimize your site for search engines by using keywords, meta tags, and other SEO techniques.
Step 8: Test your site on different devices and browsers to ensure it looks and functions properly.
Step 9: Launch your site and promote it through social media, email marketing, and other channels.
Step 10: Monitor and update your site regularly to keep it fresh and engaging for visitors. [end of text]

llama_print_timings:        load time =   55100.48 ms
llama_print_timings:      sample time =      19.08 ms /   228 runs   (    0.08 ms per token, 11950.31 tokens per second)
llama_print_timings: prompt eval time =     553.74 ms /    19 tokens (   29.14 ms per token,    34.31 tokens per second)
llama_print_timings:        eval time =   18283.04 ms /   227 runs   (   80.54 ms per token,    12.42 tokens per second)
llama_print_timings:       total time =   18889.22 ms /   246 tokens
ggml_metal_free: deallocating
Log end
obriensystems commented 9 months ago

As expected the 40G model cannot load on a 32G macbook pro M1Max - we get a sawtooth 8 times and then crash - only the M2Ultra with at least 64G with 50G vRAM allocation can run the 40G model no problem. The 192g M2Ultra will top out at 79% or 153G vRAM

michaelobrien@mbp7 llama.cpp % ./main -m models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed  = 1706995902
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/miqu-1-70b-Requant-b2035-iMat-c32_ch400-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = D:\HF
llama_model_loader: - kv   2:                       llama.context_length u32              = 32764
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 14
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_K:  471 tensors
llama_model_loader: - type q5_K:   90 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.55 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = D:\HF
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 16384.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 16384.00 MiB, offs =  16964812800
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5072.94 MiB, offs =  33929625600, (37841.00 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      Metal buffer size = 37430.75 MiB
llm_load_tensors:        CPU buffer size =   140.62 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   160.00 MiB, (38001.69 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_kv_cache_init:      Metal KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (38001.70 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   159.52 MiB, (38161.20 / 21845.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model:      Metal compute buffer size =   159.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.60 MiB
llama_new_context_with_model: graph splits (measure): 3
ggml_metal_graph_compute: command buffer 6 failed with status 5

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1:ggml_metal_graph_compute: command buffer 6 failed with status 5
atherineggml_metal_graph_compute: command buffer 6 failed with status 5
usto
  [Restored Feb 3, 2024 at 16:46:15]
Last login: Sat Feb  3 16:46:15 on ttys001
obriensystems commented 9 months ago
Screenshot 2024-02-03 at 17 05 57 Screenshot 2024-02-03 at 17 06 14
obriensystems commented 9 months ago

requested formal meta access via https://huggingface.co/meta-llama/Llama-2-70b/tree/main

obriensystems commented 9 months ago

https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF 49G https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q5_K_M.gguf 25G https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q2_K.gguf

michaelobrien@MichaelacStudio llama.cpp % ./main -m models/codellama-70b-hf.Q5_K_M.gguf -p "factorial function in java 21" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed  = 1707071785
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-70b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = codellama_codellama-70b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 36864.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  9663.92 MiB, offs =  38439534592, (46527.98 / 49152.00)
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      Metal buffer size = 46322.72 MiB
llm_load_tensors:        CPU buffer size =   171.96 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   160.00 MiB, (46689.55 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (46689.56 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   159.52 MiB, (46849.06 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =   159.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.60 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 factorial function in java 2148
factorial function in java 6647
Factorial of a non-negative integer, is multiplication of all integers smaller than or equal to n. For example factorial of 3 is 3*2*1 which is 6.

We have discussed how to compute factorial of large numbers using BigInteger in previous post. In this post we will discuss another way to calculate factorial of large number using java program. This method can be used in programming competitions where memory and time are constraints. We need only print the result so it is not necessary to store complete result.

1) Start from leftmost digit, i = 0
2) Find value of (arr[i]*x + carry)%10 and update carry  
3) If current digit becomes 0, then delete that digit from array
4) Decrement size

Following is implementation of above idea.
*/
// A function to multiply x with the number represented by arr[]. 
// This function uses simple school mathematics for multiplication. 
// This function may value of result if the input integer has 
// more digits
int multiply(int x, vector<int> &arr) { 
    int carry = 0; // Initialize carry 

    // One by one multiply n with individual digits of res[] 
    for (int i=0; i < arr.size(); i++) { 
        int prod = arr[i] * x + carry; 
        // Store last digit of 'prod' in res[] 
        arr[i] = prod % 10;  

        // Put rest in carry 
        carry  = prod/10;    
    } 

    // Put carry in res and increase result size 
    while (carry) { 

llama_print_timings:        load time =   69713.02 ms
llama_print_timings:      sample time =      32.47 ms /   400 runs   (    0.08 ms per token, 12320.20 tokens per second)
llama_print_timings: prompt eval time =     597.63 ms /     9 tokens (   66.40 ms per token,    15.06 tokens per second)
llama_print_timings:        eval time =   45779.20 ms /   399 runs   (  114.73 ms per token,     8.72 tokens per second)
llama_print_timings:       total time =   46466.62 ms /   408 tokens
ggml_metal_free: deallocating
Log end
Screenshot 2024-02-04 at 13 37 46

reruns use cached ram

Screenshot 2024-02-04 at 13 43 57
llama_print_timings:        load time =    2994.07 ms
llama_print_timings:      sample time =      19.76 ms /   249 runs   (    0.08 ms per token, 12603.77 tokens per second)
llama_print_timings: prompt eval time =     595.16 ms /     9 tokens (   66.13 ms per token,    15.12 tokens per second)
llama_print_timings:        eval time =   28286.44 ms /   248 runs   (  114.06 ms per token,     8.77 tokens per second)
llama_print_timings:       total time =   28935.86 ms /   257 tokens
ggml_metal_free: deallocating
obriensystems commented 9 months ago

Access granted from Meta https://huggingface.co/meta-llama/Llama-2-70b/tree/main

Google C4 dataset (800G) of processed "common crawl" - https://github.com/allenai/allennlp/discussions/5056 https://huggingface.co/spaces/optimum/llm-perf-leaderboard

https://huggingface.co/YokaiKoibito/falcon-40b-GGUF/tree/main

obriensystems commented 9 months ago

Compare M1Max and M2Ultra 14G https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/blob/main/llama-2-13b-chat.Q8_0.gguf

obriensystems commented 9 months ago

M2Ultra

michaelobrien@MichaelacStudio llama.cpp % ./main -m models/llama-2-13b-chat.Q8_0.gguf -p "factorial function in java 21" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed  = 1707105596
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from models/llama-2-13b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 13023.86 MiB, (13023.92 / 49152.00)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size = 13023.86 MiB
llm_load_tensors:        CPU buffer size =   166.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   400.00 MiB, (13425.48 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    11.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (13425.50 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    82.52 MiB, (13508.00 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =    82.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    11.00 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 factorial function in java 21 Apr 2016
The factorial of a number is the product of that number with all the preceding integers. For example, the factorial of 4 is 4 x 3 x 2 x 1 = 24. In Java, you can write a function to calculate the factorial of an integer using recursion or loops. Here's an example of how to do it using recursion:

public static int factorial(int n) { if (n == 0) { return 1; // base case } else { return n * factorial(n-1); // recursive case } }

Explanation:

* The function takes an integer `n` as input.
* If `n` is 0, the function returns 1 (the base case).
* Otherwise, it calculates the factorial of `n-1` using the recursive call `factorial(n-1)`, and then multiplies the result by `n`.

Here's an example of how to use this function:

System.out.println(factorial(5)); // prints 120

This will calculate the factorial of 5, which is 5 x 4 x 3 x 2 x 1 = 120.

You can also write a loop-based implementation of the factorial function, like this:

public static int factorial(int n) { int result = 1; // initialization for (int i = 1; i <= n; i++) { result *= i; } return result; }

Explanation:

* The function takes an integer `n` as input.
* It initializes a variable `result` to
llama_print_timings:        load time =   19859.64 ms
llama_print_timings:      sample time =      33.11 ms /   400 runs   (    0.08 ms per token, 12081.67 tokens per second)
llama_print_timings: prompt eval time =     110.85 ms /     9 tokens (   12.32 ms per token,    81.19 tokens per second)
llama_print_timings:        eval time =   10988.76 ms /   399 runs   (   27.54 ms per token,    36.31 tokens per second)
llama_print_timings:       total time =   11174.67 ms /   408 tokens
ggml_metal_free: deallocating
Log end
obriensystems commented 9 months ago

M1Max

michaelobrien@mbp7 llama.cpp % ./main -m models/llama-2-13b-chat.Q8_0.gguf -p "factorial function in java 21" -n 400 -e
Log start
main: build = 2050 (19122117)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
main: seed  = 1707105556
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from models/llama-2-13b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 13023.86 MiB, (13023.92 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size = 13023.86 MiB
llm_load_tensors:        CPU buffer size =   166.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/michaelobrien/wse_github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   400.00 MiB, (13424.61 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    11.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (13424.62 / 21845.34)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    82.52 MiB, (13507.12 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =    82.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    11.00 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 factorial function in java 21 Jan 06, 2012
Recursive Factorial Function in Java 5 examples
A recursive factorial function in Java is a function that calls itself repeatedly until it reaches the base case. Here are five examples of how to implement a recursive factorial function in Java:
Example 1: Basic Recursive Factorial Function
public static int factorial(int n) {
if (n == 0) {
return 1; // base case
else {
return n * factorial(n-1); // recursive call
}
This function takes an integer `n` as input and returns its factorial. The function checks if `n` is equal to 0, in which case the result is simply 1. Otherwise, it calculates the product of `n` and the factorial of `n-1`, which is calculated recursively.
Example 2: Factorial Function with Memoization
public static int factorial(int n) {
if (n == 0) {
return 1; // base case
else if (memoizedFactorials != null && memoizedFactorials.containsKey(n)) {
return memoizedFactorials.get(n); // use memorized value
else {
int result = n * factorial(n-1); // calculate and remember the result
memoizedFactorials.put(n, result); // cache the result
}
This function is similar to the previous example, but it uses memoization to store previously calculated values of the factorial. This can improve the performance of the function if `n` is large and the function is called many times with smaller values of `n`.
Example 3: Factorial Function with Dynamic Programming
public static int factorial(int n) {
if (n ==
llama_print_timings:        load time =   15574.77 ms
llama_print_timings:      sample time =      36.15 ms /   400 runs   (    0.09 ms per token, 11066.23 tokens per second)
llama_print_timings: prompt eval time =     161.53 ms /     9 tokens (   17.95 ms per token,    55.72 tokens per second)
llama_print_timings:        eval time =   18173.40 ms /   399 runs   (   45.55 ms per token,    21.96 tokens per second)
llama_print_timings:       total time =   18437.33 ms /   408 tokens
ggml_metal_free: deallocating
Log end

M1Max 32 GPU

ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   400.00 MiB, (13424.61 / 21845.34)
llama_print_timings:      sample time =      36.15 ms /   400 runs   (    0.09 ms per token, 11066.23 tokens per second)
llama_print_timings: prompt eval time =     161.53 ms /     9 tokens (   17.95 ms per token,    55.72 tokens per second)
llama_print_timings:        eval time =   18173.40 ms /   399 runs   (   45.55 ms per token,    21.96 tokens per second)

M2Ultra 60 GPU

ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 13023.86 MiB, (13023.92 / 49152.00)
llama_print_timings:      sample time =      33.11 ms /   400 runs   (    0.08 ms per token, 12081.67 tokens per second)
llama_print_timings: prompt eval time =     110.85 ms /     9 tokens (   12.32 ms per token,    81.19 tokens per second)
llama_print_timings:        eval time =   10988.76 ms /   399 runs   (   27.54 ms per token,    36.31 tokens per second)

The M2Ultra is 65% faster for eval time than the M1Max with 88% more cores

obriensystems commented 9 months ago

review/comment/thank-you to https://github.com/ggerganov/llama.cpp/discussions/4167

obriensystems commented 9 months ago

article on

add server command

michaelobrien@mbp7 llama.cpp % ./server -m models/llama-2-13b-chat.Q8_0.gguf  -c 4096
vailable slots:
 -> Slot 0 - max context: 4096
{"timestamp":1707182485,"level":"INFO","function":"main","line":2555,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52827,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1707182515,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52825,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 15 tokens
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =     958.18 ms /    15 tokens (   63.88 ms per token,    15.65 tokens per second)
print_timings:        eval time =    3295.68 ms /    73 runs   (   45.15 ms per token,    22.15 tokens per second)
print_timings:       total time =    4253.86 ms
slot 0 released (88 tokens in cache)
{"timestamp":1707182559,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52830,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 76]
slot 0 : in cache: 86 tokens | to process: 15 tokens
slot 0 : kv cache rm - [86, end)

print_timings: prompt eval time =     862.64 ms /    15 tokens (   57.51 ms per token,    17.39 tokens per second)
print_timings:        eval time =    3382.12 ms /    74 runs   (   45.70 ms per token,    21.88 tokens per second)
print_timings:       total time =    4244.76 ms
slot 0 released (175 tokens in cache)
{"timestamp":1707182582,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52835,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 153]
slot 0 : in cache: 173 tokens | to process: 17 tokens
slot 0 : kv cache rm - [173, end)

print_timings: prompt eval time =     667.33 ms /    17 tokens (   39.25 ms per token,    25.47 tokens per second)
print_timings:        eval time =    6123.09 ms /   133 runs   (   46.04 ms per token,    21.72 tokens per second)
print_timings:       total time =    6790.41 ms
slot 0 released (323 tokens in cache)
{"timestamp":1707182702,"level":"INFO","function":"log_server_request","line":2375,"message":"request","remote_addr":"127.0.0.1","remote_port":52843,"status":200,"method":"POST","path":"/completion","params":{}}

michaelobrien@mbp7 tensorflow % curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "describe quantum computing at Google","n_predict": 2}'
{"content":"\n\n","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"models/llama-2-13b-chat.Q8_0.gguf","n_ctx":4096,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":true,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"models/llama-2-13b-chat.Q8_0.gguf","prompt":"describe quantum computing at Google","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":45.216,"predicted_n":2,"predicted_per_second":44.232130219391365,"predicted_per_token_ms":22.608,"prompt_ms":611.727,"prompt_n":6,"prompt_per_second":9.808296838295515,"prompt_per_token_ms":101.9545},"tokens_cached":7,"tokens_evaluated":6,"tokens_predicted":2,"truncated":false}%     
Screenshot 2024-02-05 at 20 26 18