Closed Edisonwei54 closed 1 year ago
Additional info: this is a BPE model. There seems to be a general issue with llama.cpp CUDA + BPE models at the moment.
Command to create the FP16 GGUF was:
python3 ./convert.py --outtype f16 --outfile /workspace/process/baai_aquilachat2-34b/gguf/aquilachat2-34b.fp16.gguf /workspace/process/baai_aquilachat2-34b/source --vocabtype bpe
Oh nvm, just realised this is already addressed in this thread: https://github.com/ggerganov/llama.cpp/issues/3740#issuecomment-1784737709
哦,nvm,刚刚意识到这已经在这个线程中解决了:#3740(评论)
https://github.com/ggerganov/llama.cpp/issues/3740#issuecomment-1783125187 Thank you very much, I also tried this method just now, and it works well!!!
Prerequisites
Please answer the following questions for yourself before submitting an issue.
!all yes!
Expected Behavior
I'm using llama.cpp loading AquilaChat2-34B-16K-Q4_0.gguf, and I think this way will allow me to have a conversation with this model
Current Behavior
I used the following command to use llama.cpp and there was no problem loading ggml-model-q4_0.gguf, but when loading AquilaChat2-34B-16K-Q4_0.gguf he ended up with a CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument current device: 0 command: ./main -m /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf --color \ --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 \ --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: AuthenticAMD Model name: AMD EPYC 7F72 24-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 0 Frequency boost: enabled CPU max MHz: 3200.0000 CPU min MHz: 2500.0000 BogoMIPS: 6400.16 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa t pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid e xtd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp _legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet ch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ib pb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed a dx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc c qm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerpt r rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_s cale vmcb_clean flushbyasid decodeassists pausefilter pfthreshol d avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco v succor smca sev sev_es Virtualization features: Virtualization: AMD-V Caches (sum of all):
L1d: 1.5 MiB (48 instances) L1i: 1.5 MiB (48 instances) L2: 24 MiB (48 instances) L3: 384 MiB (24 instances) NUMA:
NUMA node(s): 2 NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerabilities:
Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP prote ction Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitiza tion Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB f illing, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected
$ uname -a
Linux ps 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.13
GNU Make 4.3 Built for x86_64-pc-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Failure Information (for bugs)
CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument current device: 0
Steps to Reproduce
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp make LLAMA_CUBLAS=1 ./main -m /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf --color \ --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 \ --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.
Example environment info:
Log start main: build = 1428 (6961c4b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1698634798 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6 llama_model_loader: loaded meta data with 21 key-value pairs and 543 tensors from /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf (version GGUF V2 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 6144, 100008, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 6144, 6144, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 6144, 1024, 1, 1 ] ... ... llama_model_loader: - tensor 540: blk.59.ffn_norm.weight f32 [ 6144, 1, 1, 1 ] llama_model_loader: - tensor 541: output_norm.weight f32 [ 6144, 1, 1, 1 ] llama_model_loader: - tensor 542: output.weight q6_K [ 6144, 100008, 1, 1 ] llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
... ... tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 20: general.quantization_version u32
llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q4_0: 421 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 9/100008 vs 8/100008 ). llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 100008 llm_load_print_meta: n_merges = 99743 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 6144 llm_load_print_meta: n_head = 48 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 60 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 6 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.25 llm_load_print_meta: model type = 30B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 33.69 B llm_load_print_meta: model size = 17.80 GiB (4.54 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 100006 '[CLS]' llm_load_print_meta: EOS token = 100007 '' llm_load_print_meta: UNK token = 0 '<|endoftext|>' llm_load_print_meta: LF token = 129 'Ä' llm_load_tensors: ggml ctx size = 0.18 MB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 329.80 MB llm_load_tensors: offloading 60 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 63/63 layers to GPU llm_load_tensors: VRAM used: 17898.53 MB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.25 llama_kv_cache_init: offloading v cache to GPU llama_kv_cache_init: offloading k cache to GPU llama_kv_cache_init: VRAM kv self = 480.00 MB llama_new_context_with_model: kv self size = 480.00 MB llama_new_context_with_model: compute buffer total size = 122.13 MB llama_new_context_with_model: VRAM scratch buffer: 116.00 MB llama_new_context_with_model: total VRAM used: 18494.53 MB (model: 17898.53 MB, context: 596.00 MB)
system_info: n_threads = 8 / 96 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction:
' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 10000, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.200 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument current device: 0