Open obriensystems opened 9 months ago
49G on CPU (64G) - RTX-3500 Lenovo P1Gen6 13800H https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q5_K_M.gguf
C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed = 1707281416
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama_codellama-70b-hf
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name = codellama_codellama-70b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors: CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 17.01 MiB
llama_new_context_with_model: CPU compute buffer size = 158.40 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
binary tree in java
==============
I have created this project to demonstrate how one could implement a Binary Tree in Java. This is for educational purposes only and is not intended for production use!
I have included the JUnit tests I used to help verify functionality while developing, but they are not comprehensive by any means.
Usage
-----------
The following code will create a tree with root node of value 42 and add two children (19 and 63):
```java
BinaryTree tree = new BinaryTree();
tree.add(42);
tree.add(19);
tree.add(63);
You can also pass in an array of values to populate the tree:
int[] values = { 4, 5, 7, 8 };
BinaryTree tree = new BinaryTree(values);
This project is released under the MIT license. See LICENSE for more details. [end of text]
llama_print_timings: load time = 14205.60 ms
llama_print_timings: sample time = 56.09 ms / 228 runs ( 0.25 ms per token, 4065.26 tokens per second)
llama_print_timings: prompt eval time = 5812.95 ms / 5 tokens ( 1162.59 ms per token, 0.86 tokens per second)
llama_print_timings: eval time = 325544.97 ms / 227 runs ( 1434.12 ms per token, 0.70 tokens per second)
llama_print_timings: total time = 331648.96 ms / 232 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed = 1707315232
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama_codellama-70b-hf
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name = codellama_codellama-70b-hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
binary tree in java
// create a node class to store data and pointers to left and right child nodes class Node{ int data; // holds the key Node left,right; // pointer to left and right children
public Node(int item){
data=item;
left=right=null;
}
} // create a binary tree class that will provide functions to create and manipulate the tree. class BinaryTree{ Node root; // pointer to the root node of the tree (global variable)
public void printPostorder(Node node){
if(node==null)
return;
printPostorder(node.left);
printPostorder(node.right);
System.out.print(node.data+" ");
}
// method to print the tree in post-order.
public void printInorder(Node node){
if(node==null)
return;
printPostorder(node.left);
System.out.print(node.data+" ");
printPostorder(node.right);
}
// method to print the tree in pre-order.
public void printPreorder(Node node){
if(node==null)
return;
System.out.print(node.data+" ");
printPostorder(node.left);
printPostorder(node.right);
}
} [end of text]
llama_print_timings: load time = 12756.12 ms llama_print_timings: sample time = 73.23 ms / 331 runs ( 0.22 ms per token, 4520.25 tokens per second) llama_print_timings: prompt eval time = 5825.50 ms / 5 tokens ( 1165.10 ms per token, 0.86 tokens per second) llama_print_timings: eval time = 476828.62 ms / 330 runs ( 1444.94 ms per token, 0.69 tokens per second) llama_print_timings: total time = 483038.49 ms / 335 tokens
i13900k 192G ram (2 unused so far 4090 24x2 gpus) -
C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2093 (aa7ab99b)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed = 1707326237
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama_codellama-70b-hf
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name = codellama_codellama-70b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors: CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 17.01 MiB
llama_new_context_with_model: CPU compute buffer size = 158.40 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
binary tree in java
```java
package com.test;
import java.util.*;
class BinaryTree{
Node root=null;
BinaryTree(int a){
root=new Node();
root.data=a;
}
void add_node(Node node,int a){
if(a<node.data){
if(node.left!=null)
add_node(node.left,a);
else{
Node newnode=new Node();
newnode.data=a;
node.left=newnode;
}
}
if(a>=node.data){
if(node.right!=null)
add_node(node.right,a);
else{
Node newnode=new Node();
newnode.data=a;
node.right=newnode;
}
}
}
void inorder(Node node){
if(node==null) return ;
inorder(node.left);
System.out.print(node.data+" ");
inorder(node.right);
}
class Node{
int data;
Node left,right=null;
}
public static void main(String[] args) {
BinaryTree bt = new BinaryTree(20);
bt.add_node(bt.root,5);
bt.add_node(bt.root,15);
bt.inorder
llama_print_timings: load time = 6103.69 ms
llama_print_timings: sample time = 51.71 ms / 400 runs ( 0.13 ms per token, 7735.45 tokens per second)
llama_print_timings: prompt eval time = 2047.35 ms / 5 tokens ( 409.47 ms per token, 2.44 tokens per second)
llama_print_timings: eval time = 400051.72 ms / 399 runs ( 1002.64 ms per token, 1.00 tokens per second)
llama_print_timings: total time = 402304.93 ms / 404 tokens
Log end
Falcon 40B on CPU 80-100G (falcon 180B needs 400G) https://huggingface.co/tiiuae/falcon-40b https://huggingface.co/TheBloke/Falcon-180B-GGUF
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
falcon-180b.Q2_K.gguf | Q2_K | 2 | 73.97 GB | 76.47 GB | smallest, significant quality loss - not recommended for most purposes |
falcon-180b.Q3_K_S.gguf | Q3_K_S | 3 | 77.77 GB | 80.27 GB | very small, high quality loss |
falcon-180b.Q3_K_M.gguf | Q3_K_M | 3 | 85.18 GB | 87.68 GB | very small, high quality loss |
falcon-180b.Q3_K_L.gguf | Q3_K_L | 3 | 91.99 GB | 94.49 GB | small, substantial quality loss |
falcon-180b.Q4_0.gguf | Q4_0 | 4 | 101.48 GB | 103.98 GB | legacy; small, very high quality loss - prefer using Q3_K_M |
falcon-180b.Q4_K_S.gguf | Q4_K_S | 4 | 101.48 GB | 103.98 GB | small, greater quality loss |
falcon-180b.Q4_K_M.gguf | Q4_K_M | 4 | 108.48 GB | 110.98 GB | medium, balanced quality - recommended |
falcon-180b.Q5_0.gguf | Q5_0 | 5 | 123.80 GB | 126.30 GB | legacy; medium, balanced quality - prefer using Q4_K_M |
falcon-180b.Q5_K_S.gguf | Q5_K_S | 5 | 123.80 GB | 126.30 GB | large, low quality loss - recommended |
falcon-180b.Q5_K_M.gguf | Q5_K_M | 5 | 130.99 GB | 133.49 GB | large, very low quality loss - recommended |
falcon-180b.Q6_K.gguf | Q6_K | 6 | 147.52 GB | 150.02 GB | very large, extremely low quality loss |
$ cat falcon-180b.Q6_K.gguf-split-* > falcon-180b.Q6_K.gguf
at 2.2 GB/s write on a samsung 990 pro NvME it takes about a min to combine the 2 into one 96G file
take out -ngl 64
C:/wse_github/llama.cpp $ ./main.exe -m models/falcon-180b.Q6_K.gguf -p "show use of lambda in java search" -n 400 -e --color -t 16
we go up to 110G but drop out
llm_load_print_meta: model size = 137.38 GiB (6.57 BPW)
all 3 parts a/b/c total 150G SSD and 140G of ram
-rw-r--r-- 1 michael 197121 147516218272 Feb 10 18:53 falcon-180b.Q6_K.gguf
-rw-r--r-- 1 michael 197121 49172072767 Feb 10 18:37 falcon-180b.Q6_K.gguf-split-a
-rw-r--r-- 1 michael 197121 49172072767 Feb 10 18:39 falcon-180b.Q6_K.gguf-split-b
-rw-r--r-- 1 michael 197121 49172072738 Feb 10 18:40 falcon-180b.Q6_K.gguf-split-c
160 of 192G ram at 91% cpu on 13900k
Think I need segment c as well 96 != 137
On 13800h p1gen6
llama_print_timings: load time = 2078.64 ms
llama_print_timings: sample time = 107.02 ms / 400 runs ( 0.27 ms per token, 3737.72 tokens per second)
llama_print_timings: prompt eval time = 450.00 ms / 6 tokens ( 75.00 ms per token, 13.33 tokens per second)
llama_print_timings: eval time = 75482.42 ms / 399 runs ( 189.18 ms per token, 5.29 tokens per second)
llama_print_timings: total time = 76450.07 ms / 405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -t 16 -n 400 -e
llama_print_timings: load time = 1963.86 ms
llama_print_timings: sample time = 88.87 ms / 292 runs ( 0.30 ms per token, 3285.81 tokens per second)
llama_print_timings: prompt eval time = 772.40 ms / 6 tokens ( 128.73 ms per token, 7.77 tokens per second)
llama_print_timings: eval time = 69189.03 ms / 291 runs ( 237.76 ms per token, 4.21 tokens per second)
llama_print_timings: total time = 70424.35 ms / 297 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -t 8 -n 400 -e --color
On 13900k desktop
llama_print_timings: load time = 1411.46 ms
llama_print_timings: sample time = 59.34 ms / 400 runs ( 0.15 ms per token, 6741.27 tokens per second)
llama_print_timings: prompt eval time = 406.02 ms / 6 tokens ( 67.67 ms per token, 14.78 tokens per second)
llama_print_timings: eval time = 102460.70 ms / 399 runs ( 256.79 ms per token, 3.89 tokens per second)
llama_print_timings: total time = 103148.10 ms / 405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at google" -n 400 -e --color -t 16
llama_print_timings: load time = 1719.92 ms
llama_print_timings: sample time = 59.50 ms / 400 runs ( 0.15 ms per token, 6723.03 tokens per second)
llama_print_timings: prompt eval time = 602.42 ms / 6 tokens ( 100.40 ms per token, 9.96 tokens per second)
llama_print_timings: eval time = 119168.22 ms / 399 runs ( 298.67 ms per token, 3.35 tokens per second)
llama_print_timings: total time = 120069.83 ms / 405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at google" -n 400 -e --color -t 8
why slower
CUDA on llama.cpp https://github.com/ggerganov/llama.cpp/issues/1470
adjusting the ENV variable works well - below or shortened copy -LC:/Progra~1/"NVIDIA~1/CUDA/v12.3/targets/x86_64-linux/lib until nvcc fatal : Cannot find compiler 'cl.exe' in PATH make: *** [Makefile:430: ggml-cuda.o] Error 1
fix - add to PATH
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\bin\Hostx64\x64
solving
nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/opt/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -Wno-pedantic -Xcompiler "-Wno-array-bounds" -c ggml-cuda.cu -o ggml-cuda.o
nvcc warning : The -std=c++11 flag is not supported with the configured host compiler. Flag will be ignored.
ggml-cuda.cu
cl : Command line error D8021 : invalid numeric argument '/Wno-array-bounds'
make: *** [Makefile:430: ggml-cuda.o] Error 2
using as a reference https://github.com/obrienlabs/CUDA-Programs/tree/main/Chapter01/gpusum as part of the book from Richard Ansorge of University of Cambridge https://www.cambridge.org/core/books/programming-in-parallel-with-cuda/C43652A69033C25AD6933368CDBE084C see https://github.com/ObrienlabsDev/blog/issues/1
revisit llama.cpp for nvidia gpus
make clean && LLAMA_CUBLAS=1 make -j
Makefile:604: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH. Stop.
C:/wse_github/llama.cpp $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
C:/wse_github/llama.cpp $ CUDA_DOCKER_ARCH=12.2
try path escaping
-IC:/Progra~1/NVIDIA~2/CUDA/v12.2/targets/x86_64-linux/include
look at https://github.com/abetlen/llama-cpp-python/discussions/871
pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
access_token='hf_cfTP...QqH'
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", token=access_token)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
model.safetensors.index.json: 100%|████████████████████████████████████████████████████████| 13.5k/13.5k [00:00<00:00, 13.5MB/s]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\huggingface_hub\file_download.py:149: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\michael\.cache\huggingface\hub\models--google--gemma-2b. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
model-00001-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 4.95G/4.95G [00:48<00:00, 103MB/s]
model-00002-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 67.1M/67.1M [00:00<00:00, 107MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.37s/it]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 274kB/s]
Traceback (most recent call last):
File "C:\wse_github\obrienlabsdev\machine-learning\gemma\gemma-gpu.py", line 9, in <module>
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py", line 789, in to
self.data = {k: v.to(device=device) for k, v in self.data.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
self.data = {k: v.to(device=device) for k, v in self.data.items()}
^^^^^^^^^^^^^^^^^^^
File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\torch\cuda\__init__.py", line 293, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
(base)
https://pytorch.org/get-started/locally/
12,1 not 12.2
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
working but - no real output
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.03s/it]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma\modeling_gemma.py:555: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars?
Answer:
Step 1/5
checking context length
Using the model-agnostic default max_length
(=20) to control the generation length
outputs = model.generate(**input_ids, max_new_tokens=1000)
working
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.03s/it]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma\modeling_gemma.py:555: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars?
Answer:
Step 1/5
1. The collapse of a neutron star can lead to the formation of a black hole.
Step 2/5
2. The black hole can then evaporate through Hawking radiation, releasing energy in the form of photons and neutrinos.
Step 3/5
3. The energy released by the black hole can be used to power a gold-making machine.
Step 4/5
4. The gold-making machine can be powered by the energy released by the black hole, which can be used to extract gold from the black hole's debris.
Step 5/5
5. The gold-making machine can then be used to produce gold for human consumption.<eos>
(base)
using
from transformers import AutoTokenizer, AutoModelForCausalLM
access_token='hf_cfTP...KXXCQqH'
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", token=access_token)
input_text = "how is gold made in collapsing neutron stars"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))
python pip summary RTX-4090 dual running cuda 12.2
332 cd machine-learning/
335 mkdir gemma
337 vi gemma-cpu.py
339 pip install -U transformers
352 pip install -U torch
353 python gemma-cpu.py
355 nvcc --version
364 pip install accelerate
366 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
368 python gemma-gpu.py
gemma-7b on dual RTX-4090 suprim liquid at 2 x 24 = 48G vram
at 20% TDP or 100 of 400W max due to lack of nvlink on ada class GPUs
on RTX-3500
C:/wse_github/llama.cpp $ make llama-server
C:/wse_github/llama.cpp $ ./llama-server.exe -m /models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -c 2048 -t 10 -n 1000 -e --color
micha@p1gen6 MINGW64 ~
$ curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "describe quantum computing at Google","n_predict": 128}'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1996 100 1929 100 67 31 1 0:01:07 0:01:00 0:00:07 467{"content":", discuss quantum supremacy and quantum advantage, and outline some potential applications of quantum computing.\n\nQuantum computing is an emerging technology that has the potential to revolutionize the way we process information. Unlike classical computers, which use bits that can be either 0 or 1, quantum computers use quantum bits or qubits that can be 0, 1, or both at the same time. This allows quantum computers to perform certain calculations exponentially faster than classical computers.\n\nGoogle has been at the forefront of quantum computing research, and in 2019 they achieved a major milestone called quantum suprem","id_slot":0,"stop":true,"model":"/models/capybarahermes-2.5-mistral-7b.Q8_0.gguf","tokens_predicted":128,"tokens_evaluated":6,"generation_settings":{"n_ctx":2048,"n_predict":1000,"model":"/models/capybarahermes-2.5-mistral-7b.Q8_0.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"describe quantum computing at Google","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":133,"timings":{"prompt_n":6,"prompt_ms":529.226,"prompt_per_token_ms":88.20433333333334,"prompt_per_second":11.337311469958014,"predicted_n":128,"predicted_ms":59840.864,"predicted_per_token_ms":467.50675,"predicted_per_second":2.1390065491033017}}
see #7
test git clone https://github.com/ggerganov/llama.cpp model https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF/blob/main/capybarahermes-2.5-mistral-7b.Q8_0.gguf
using w64devkit on Lenovo P1gen6 RTX-3500 12G https://github.com/skeeto/w64devkit/releases