ObrienlabsDev / machine-learning

Machine Learning - AI - Tensorflow - Keras - NVidia - Google
MIT License
0 stars 0 forks source link

llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual #10

Open obriensystems opened 9 months ago

obriensystems commented 9 months ago

see #7

test git clone https://github.com/ggerganov/llama.cpp model https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF/blob/main/capybarahermes-2.5-mistral-7b.Q8_0.gguf

using w64devkit on Lenovo P1gen6 RTX-3500 12G https://github.com/skeeto/w64devkit/releases

C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed  = 1707279545
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  7338.66 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    79.20 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 describe quantum computing at Google

In a new paper published in Nature, researchers from Google have announced that they have achieved “quantum supremacy” for the first time ever. What does this mean? Well, it’s a pretty big deal in the world of science and technology. Let me explain!

Quantum computing is a type of computing that uses quantum mechanics to process information. Quantum computers can perform certain calculations much faster than classical computers because they use qubits instead of bits. A bit can be either 0 or 1, while a qubit can be both 0 and 1 simultaneously thanks to a phenomenon called superposition. This allows quantum computers to solve complex problems that would take classical computers an unrealistic amount of time to solve.

Google’s new paper describes how they were able to use their 53-qubit Sycamore processor to perform a specific type of calculation in just 200 seconds, while the researchers estimate that it would take the most powerful supercomputers thousands of years to perform the same calculation. This is what Google calls “quantum supremacy” – a situation where a quantum computer can solve a problem that a classical computer simply cannot solve within a reasonable amount of time.

The specific calculation that Google used in their experiment is called a random circuit, which involves creating a large number of random operations on the qubits and then measuring the final state of the system to see if it matches a particular pattern. This type of calculation is not particularly useful on its own, but it does provide a way to measure how much computational power a quantum computer has compared to classical computers.

The achievement of quantum supremacy is significant because it shows that quantum computers can indeed outperform classical computers in certain situations. It also represents a major milestone in the development of quantum computing technology, which could have profound implications for fields like cryptography, machine learning, and materials science. However, it should be noted
llama_print_timings:        load time =    2513.29 ms
llama_print_timings:      sample time =     125.45 ms /   400 runs   (    0.31 ms per token,  3188.60 tokens per second)
llama_print_timings: prompt eval time =     636.99 ms /     6 tokens (  106.17 ms per token,     9.42 tokens per second)
llama_print_timings:        eval time =   91047.02 ms /   399 runs   (  228.19 ms per token,     4.38 tokens per second)
llama_print_timings:       total time =   92287.53 ms /   405 tokens
Log end

trying for GPU
C:/wse_github/llama.cpp $ nvidia-smi
Tue Feb  6 23:23:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.84                 Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 3500 Ada Gene...  WDDM  | 00000000:01:00.0 Off |                  Off |
| N/A   46C    P3              20W /  91W |      0MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

C:/wse_github/llama.cpp $ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Windows_NT
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -march=native -mtune=native -Xassembler -muse-unaligned-vector-move -Wdouble-promotion
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
I CC:        cc (GCC) 13.2.0
I CXX:       x86_64-w64-mingw32-g++ (GCC) 13.2.0
I NVCC:      Build cuda_12.3.r12.3/compiler.33281558_0

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move  -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds" -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:430: ggml-cuda.o] Error 1
obriensystems commented 9 months ago

49G on CPU (64G) - RTX-3500 Lenovo P1Gen6 13800H https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q5_K_M.gguf

C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed  = 1707281416
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-70b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name     = codellama_codellama-70b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   158.40 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 binary tree in java
==============

I have created this project to demonstrate how one could implement a Binary Tree in Java.  This is for educational purposes only and is not intended for production use!

I have included the JUnit tests I used to help verify functionality while developing, but they are not comprehensive by any means.

Usage
-----------

The following code will create a tree with root node of value 42 and add two children (19 and 63):

```java
        BinaryTree tree = new BinaryTree();
        tree.add(42);
        tree.add(19);
        tree.add(63);

You can also pass in an array of values to populate the tree:

        int[] values = { 4, 5, 7, 8 };
        BinaryTree tree = new BinaryTree(values);

License

This project is released under the MIT license. See LICENSE for more details. [end of text]

llama_print_timings: load time = 14205.60 ms llama_print_timings: sample time = 56.09 ms / 228 runs ( 0.25 ms per token, 4065.26 tokens per second) llama_print_timings: prompt eval time = 5812.95 ms / 5 tokens ( 1162.59 ms per token, 0.86 tokens per second) llama_print_timings: eval time = 325544.97 ms / 227 runs ( 1434.12 ms per token, 0.70 tokens per second) llama_print_timings: total time = 331648.96 ms / 232 tokens Log end C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e Log start main: build = 2060 (5ed26e1f) main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32 main: seed = 1707315232 llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama_codellama-70b-hf llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 45.40 GiB (5.65 BPW) llm_load_print_meta: general.name = codellama_codellama-70b-hf llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/81 layers to GPU llm_load_tensors: CPU buffer size = 46494.67 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB llama_new_context_with_model: CPU input buffer size = 17.01 MiB llama_new_context_with_model: CPU compute buffer size = 158.40 MiB llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

binary tree in java

// create a node class to store data and pointers to left and right child nodes class Node{ int data; // holds the key Node left,right; // pointer to left and right children

public Node(int item){
    data=item;
    left=right=null;
}

} // create a binary tree class that will provide functions to create and manipulate the tree. class BinaryTree{ Node root; // pointer to the root node of the tree (global variable)

public void printPostorder(Node node){
    if(node==null)
        return;
    printPostorder(node.left);
    printPostorder(node.right);
    System.out.print(node.data+" ");
}
// method to print the tree in post-order.
public void printInorder(Node node){
    if(node==null)
        return;

    printPostorder(node.left);
    System.out.print(node.data+" ");
    printPostorder(node.right);
}
// method to print the tree in pre-order.
public void printPreorder(Node node){
    if(node==null)
        return;

    System.out.print(node.data+" ");
    printPostorder(node.left);
    printPostorder(node.right);
}

} [end of text]

llama_print_timings: load time = 12756.12 ms llama_print_timings: sample time = 73.23 ms / 331 runs ( 0.22 ms per token, 4520.25 tokens per second) llama_print_timings: prompt eval time = 5825.50 ms / 5 tokens ( 1165.10 ms per token, 0.86 tokens per second) llama_print_timings: eval time = 476828.62 ms / 330 runs ( 1444.94 ms per token, 0.69 tokens per second) llama_print_timings: total time = 483038.49 ms / 335 tokens

obriensystems commented 9 months ago

i13900k 192G ram (2 unused so far 4090 24x2 gpus) -

image
C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2093 (aa7ab99b)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed  = 1707326237
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-70b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name     = codellama_codellama-70b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   158.40 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 binary tree in java

```java
package com.test;
import java.util.*;
class BinaryTree{
        Node root=null;
        BinaryTree(int a){
                root=new Node();
                root.data=a;
        }
        void add_node(Node node,int a){
                if(a<node.data){
                        if(node.left!=null)
                                add_node(node.left,a);
                        else{
                                Node newnode=new Node();
                                newnode.data=a;
                                node.left=newnode;
                        }
                }
                if(a>=node.data){
                        if(node.right!=null)
                                add_node(node.right,a);
                        else{
                                Node newnode=new Node();
                                newnode.data=a;
                                node.right=newnode;
                        }
                }
        }
        void inorder(Node node){
                if(node==null) return ;
                inorder(node.left);
                System.out.print(node.data+" ");
                inorder(node.right);
        }
        class Node{
                int data;
                Node left,right=null;
        }
        public static void main(String[] args) {
                BinaryTree bt = new BinaryTree(20);
                bt.add_node(bt.root,5);
                bt.add_node(bt.root,15);
                bt.inorder
llama_print_timings:        load time =    6103.69 ms
llama_print_timings:      sample time =      51.71 ms /   400 runs   (    0.13 ms per token,  7735.45 tokens per second)
llama_print_timings: prompt eval time =    2047.35 ms /     5 tokens (  409.47 ms per token,     2.44 tokens per second)
llama_print_timings:        eval time =  400051.72 ms /   399 runs   ( 1002.64 ms per token,     1.00 tokens per second)
llama_print_timings:       total time =  402304.93 ms /   404 tokens
Log end
obriensystems commented 9 months ago

Falcon 40B on CPU 80-100G (falcon 180B needs 400G) https://huggingface.co/tiiuae/falcon-40b https://huggingface.co/TheBloke/Falcon-180B-GGUF

Name Quant method Bits Size Max RAM required Use case
falcon-180b.Q2_K.gguf Q2_K 2 73.97 GB 76.47 GB smallest, significant quality loss - not recommended for most purposes
falcon-180b.Q3_K_S.gguf Q3_K_S 3 77.77 GB 80.27 GB very small, high quality loss
falcon-180b.Q3_K_M.gguf Q3_K_M 3 85.18 GB 87.68 GB very small, high quality loss
falcon-180b.Q3_K_L.gguf Q3_K_L 3 91.99 GB 94.49 GB small, substantial quality loss
falcon-180b.Q4_0.gguf Q4_0 4 101.48 GB 103.98 GB legacy; small, very high quality loss - prefer using Q3_K_M
falcon-180b.Q4_K_S.gguf Q4_K_S 4 101.48 GB 103.98 GB small, greater quality loss
falcon-180b.Q4_K_M.gguf Q4_K_M 4 108.48 GB 110.98 GB medium, balanced quality - recommended
falcon-180b.Q5_0.gguf Q5_0 5 123.80 GB 126.30 GB legacy; medium, balanced quality - prefer using Q4_K_M
falcon-180b.Q5_K_S.gguf Q5_K_S 5 123.80 GB 126.30 GB large, low quality loss - recommended
falcon-180b.Q5_K_M.gguf Q5_K_M 5 130.99 GB 133.49 GB large, very low quality loss - recommended
falcon-180b.Q6_K.gguf Q6_K 6 147.52 GB 150.02 GB very large, extremely low quality loss
obriensystems commented 9 months ago
$ cat falcon-180b.Q6_K.gguf-split-* > falcon-180b.Q6_K.gguf

at 2.2 GB/s write on a samsung 990 pro NvME it takes about a min to combine the 2 into one 96G file

take out -ngl 64


C:/wse_github/llama.cpp $ ./main.exe -m models/falcon-180b.Q6_K.gguf -p "show use of lambda in java search" -n 400 -e --color -t 16

we go up to 110G but drop out
llm_load_print_meta: model size       = 137.38 GiB (6.57 BPW)

all 3 parts a/b/c total 150G SSD and 140G of ram

-rw-r--r-- 1 michael 197121 147516218272 Feb 10 18:53 falcon-180b.Q6_K.gguf
-rw-r--r-- 1 michael 197121  49172072767 Feb 10 18:37 falcon-180b.Q6_K.gguf-split-a
-rw-r--r-- 1 michael 197121  49172072767 Feb 10 18:39 falcon-180b.Q6_K.gguf-split-b
-rw-r--r-- 1 michael 197121  49172072738 Feb 10 18:40 falcon-180b.Q6_K.gguf-split-c

160 of 192G ram at 91% cpu on 13900k

Think I need segment c as well 96 != 137

obriensystems commented 9 months ago

On 13800h p1gen6

llama_print_timings:        load time =    2078.64 ms
llama_print_timings:      sample time =     107.02 ms /   400 runs   (    0.27 ms per token,  3737.72 tokens per second)
llama_print_timings: prompt eval time =     450.00 ms /     6 tokens (   75.00 ms per token,    13.33 tokens per second)
llama_print_timings:        eval time =   75482.42 ms /   399 runs   (  189.18 ms per token,     5.29 tokens per second)
llama_print_timings:       total time =   76450.07 ms /   405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -t 16 -n 400 -e

llama_print_timings:        load time =    1963.86 ms
llama_print_timings:      sample time =      88.87 ms /   292 runs   (    0.30 ms per token,  3285.81 tokens per second)
llama_print_timings: prompt eval time =     772.40 ms /     6 tokens (  128.73 ms per token,     7.77 tokens per second)
llama_print_timings:        eval time =   69189.03 ms /   291 runs   (  237.76 ms per token,     4.21 tokens per second)
llama_print_timings:       total time =   70424.35 ms /   297 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -t 8 -n 400 -e --color

On 13900k desktop

llama_print_timings:        load time =    1411.46 ms
llama_print_timings:      sample time =      59.34 ms /   400 runs   (    0.15 ms per token,  6741.27 tokens per second)
llama_print_timings: prompt eval time =     406.02 ms /     6 tokens (   67.67 ms per token,    14.78 tokens per second)
llama_print_timings:        eval time =  102460.70 ms /   399 runs   (  256.79 ms per token,     3.89 tokens per second)
llama_print_timings:       total time =  103148.10 ms /   405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf  -p "describe quantum computing at google" -n 400 -e --color -t 16

llama_print_timings:        load time =    1719.92 ms
llama_print_timings:      sample time =      59.50 ms /   400 runs   (    0.15 ms per token,  6723.03 tokens per second)
llama_print_timings: prompt eval time =     602.42 ms /     6 tokens (  100.40 ms per token,     9.96 tokens per second)
llama_print_timings:        eval time =  119168.22 ms /   399 runs   (  298.67 ms per token,     3.35 tokens per second)
llama_print_timings:       total time =  120069.83 ms /   405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf  -p "describe quantum computing at google" -n 400 -e --color -t 8

why slower

obriensystems commented 9 months ago

CUDA on llama.cpp https://github.com/ggerganov/llama.cpp/issues/1470

adjusting the ENV variable works well - below or shortened copy -LC:/Progra~1/"NVIDIA~1/CUDA/v12.3/targets/x86_64-linux/lib until nvcc fatal : Cannot find compiler 'cl.exe' in PATH make: *** [Makefile:430: ggml-cuda.o] Error 1

fix - add to PATH

C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\bin\Hostx64\x64

solving

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/opt/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move  -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds" -c ggml-cuda.cu -o ggml-cuda.o
nvcc warning : The -std=c++11 flag is not supported with the configured host compiler. Flag will be ignored.
ggml-cuda.cu
cl : Command line error D8021 : invalid numeric argument '/Wno-array-bounds'
make: *** [Makefile:430: ggml-cuda.o] Error 2

using as a reference https://github.com/obrienlabs/CUDA-Programs/tree/main/Chapter01/gpusum as part of the book from Richard Ansorge of University of Cambridge https://www.cambridge.org/core/books/programming-in-parallel-with-cuda/C43652A69033C25AD6933368CDBE084C see https://github.com/ObrienlabsDev/blog/issues/1

obriensystems commented 9 months ago

revisit llama.cpp for nvidia gpus

make clean && LLAMA_CUBLAS=1 make -j
Makefile:604: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH.  Stop.

C:/wse_github/llama.cpp $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
C:/wse_github/llama.cpp $ CUDA_DOCKER_ARCH=12.2

try path escaping
 -IC:/Progra~1/NVIDIA~2/CUDA/v12.2/targets/x86_64-linux/include

look at https://github.com/abetlen/llama-cpp-python/discussions/871

obriensystems commented 9 months ago
pip install accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM

access_token='hf_cfTP...QqH'

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", token=access_token)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
model.safetensors.index.json: 100%|████████████████████████████████████████████████████████| 13.5k/13.5k [00:00<00:00, 13.5MB/s]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\huggingface_hub\file_download.py:149: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\michael\.cache\huggingface\hub\models--google--gemma-2b. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
model-00001-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 4.95G/4.95G [00:48<00:00, 103MB/s]
model-00002-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 67.1M/67.1M [00:00<00:00, 107MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.37s/it]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 274kB/s]
Traceback (most recent call last):
  File "C:\wse_github\obrienlabsdev\machine-learning\gemma\gemma-gpu.py", line 9, in <module>
    input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                    ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\torch\cuda\__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
(base)
obriensystems commented 9 months ago

https://pytorch.org/get-started/locally/

12,1 not 12.2

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

working but - no real output
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.03s/it]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma\modeling_gemma.py:555: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars?

Answer:

Step 1/5
obriensystems commented 9 months ago

checking context length Using the model-agnostic default max_length (=20) to control the generation length

outputs = model.generate(**input_ids, max_new_tokens=1000)

working

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.03s/it]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma\modeling_gemma.py:555: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars?

Answer:

Step 1/5
1. The collapse of a neutron star can lead to the formation of a black hole.

Step 2/5
2. The black hole can then evaporate through Hawking radiation, releasing energy in the form of photons and neutrinos.

Step 3/5
3. The energy released by the black hole can be used to power a gold-making machine.

Step 4/5
4. The gold-making machine can be powered by the energy released by the black hole, which can be used to extract gold from the black hole's debris.

Step 5/5
5. The gold-making machine can then be used to produce gold for human consumption.<eos>
(base)

using

from transformers import AutoTokenizer, AutoModelForCausalLM

access_token='hf_cfTP...KXXCQqH'

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", token=access_token)

input_text = "how is gold made in collapsing neutron stars"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))
obriensystems commented 9 months ago

python pip summary RTX-4090 dual running cuda 12.2

332  cd machine-learning/
  335  mkdir gemma
  337  vi gemma-cpu.py
  339  pip install -U transformers
  352  pip install -U torch
  353  python gemma-cpu.py
  355  nvcc --version
  364  pip install accelerate
  366  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  368  python gemma-gpu.py
obriensystems commented 9 months ago

image image

obriensystems commented 9 months ago

image

image

obriensystems commented 9 months ago

gemma-7b on dual RTX-4090 suprim liquid at 2 x 24 = 48G vram image

image

image

at 20% TDP or 100 of 400W max due to lack of nvlink on ada class GPUs

obriensystems commented 3 months ago

llama-server

on RTX-3500

C:/wse_github/llama.cpp $ make llama-server

C:/wse_github/llama.cpp $ ./llama-server.exe -m /models/capybarahermes-2.5-mistral-7b.Q8_0.gguf    -p "describe quantum computing at Google" -c 2048 -t 10 -n 1000 -e --color

micha@p1gen6 MINGW64 ~
$ curl --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "describe quantum computing at Google","n_predict": 128}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1996  100  1929  100    67     31      1  0:01:07  0:01:00  0:00:07   467{"content":", discuss quantum supremacy and quantum advantage, and outline some potential applications of quantum computing.\n\nQuantum computing is an emerging technology that has the potential to revolutionize the way we process information. Unlike classical computers, which use bits that can be either 0 or 1, quantum computers use quantum bits or qubits that can be 0, 1, or both at the same time. This allows quantum computers to perform certain calculations exponentially faster than classical computers.\n\nGoogle has been at the forefront of quantum computing research, and in 2019 they achieved a major milestone called quantum suprem","id_slot":0,"stop":true,"model":"/models/capybarahermes-2.5-mistral-7b.Q8_0.gguf","tokens_predicted":128,"tokens_evaluated":6,"generation_settings":{"n_ctx":2048,"n_predict":1000,"model":"/models/capybarahermes-2.5-mistral-7b.Q8_0.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"describe quantum computing at Google","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":133,"timings":{"prompt_n":6,"prompt_ms":529.226,"prompt_per_token_ms":88.20433333333334,"prompt_per_second":11.337311469958014,"predicted_n":128,"predicted_ms":59840.864,"predicted_per_token_ms":467.50675,"predicted_per_second":2.1390065491033017}}