ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.82k stars 9.73k forks source link

special token handling sometimes produces garbage output with AMD ROCM/HIP #3705

Closed hansejo closed 7 months ago

hansejo commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Running models with special tokens (e.g. ChatML) with GPU offload via HIPBLAS should produce output similar to running pure CPU

Current Behavior

Instead running with -ngl 35 and -ngl 32 causes the model to fill the context with hashes "#"

Environment and Context

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 5700X 8-Core Processor
    CPU family:          25
    Model:               33
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU(s) scaling MHz:  55%
    CPU max MHz:         4661.7178
    CPU min MHz:         2200.0000
    BogoMIPS:            6790.71
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall 
                         nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl
                          pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp
                         _legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_co
                         re perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase 
                         bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsa
                         ves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv s
                         vm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload
                          vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    32 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET, no microcode
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

$ uname -a

Artix Linux (Arch-based):

Linux art 6.5.7-artix1-1 #1 SMP PREEMPT_DYNAMIC Sun, 15 Oct 2023 22:13:26 +0000 x86_64 GNU/Linux

$ pacman -Qi rocm-hip-sdk

Name            : rocm-hip-sdk
Version         : 5.6.1-1
Description     : Develop applications using HIP and libraries for AMD platforms
Architecture    : x86_64
URL             : https://rocm.docs.amd.com/
Licenses        : custom:None
Groups          : None
Provides        : None
Depends On      : rocm-core  rocm-hip-libraries  rocm-llvm  rocm-hip-runtime  hipblas  hipcub  hipfft  hipsparse  hipsolver
                  miopen-hip  rccl  rocalution  rocblas  rocfft  rocprim  rocrand  rocsolver  rocsparse  rocthrust
Optional Deps   : None
Required By     : rocm-ml-sdk
Optional For    : None
Conflicts With  : None
Replaces        : None
Installed Size  : 0.00 B
Packager        : Torsten Keßler <tpkessler@archlinux.org>
Build Date      : Tue 05 Sep 2023 22:59:50
Install Date    : Sun 24 Sep 2023 09:24:16
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature
$ python3 --version
Python 3.11.5

$ make --version
GNU Make 4.4.1
Built for x86_64-pc-linux-gnu

$ g++ --version
g++ (GCC) 13.2.1 20230801

Failure Information (for bugs)

Building with AMD HIPBlas, and enabling gpu offload (-ngl 32 and -ngl 35 tested) and using models with special tokenizers will cause the following

Current model's I've tested that this affects:

Failure Logs

Example running openhermes-2-mistral-7b.Q5_K_M.gguf, but happens with dolphin 2.1 as well:

./main -e -m mistral/openhermes-2-mistral-7b.Q5_K_M.gguf --temp 0.7 -c 4096 --repeat_penalty 1.1 --color -p "<|im_start|>user\nExplain how Linux can win in the desktop space Apple and Microsoft invest more money into their desktop systems.<|im_end|>\n<|im_start|>assistant\n"

output:

user
Explain how Linux can win in the desktop space Apple and Microsoft invest more money into their desktop systems.
assistant
######################################################################################################################################

Example environment info:

llama.cpp$ git log | head -1
commit 465219b9143ac01db0990bbcb0a081ef72ec2008

$ sha256sum
11b6d5eff77485fe39f54e1612cc42f82b5fd4d9d5473be683e7a5c09ccfdbc1  openhermes-2-mistral-7b.Q5_K_M.gguf
786b79cf8fb54ed125ee17bfcf66cb3b3e81fbbccd770406bdc17b1ab8752a2b  dolphin-2.1-mistral-7b.Q5_K_M.gguf
staviq commented 1 year ago

If you can, please post main.xxxx.log file from the failing run.

hansejo commented 1 year ago

If you can, please post main.xxxx.log file from the failing run.

@staviq Sorry I forgot. Here is a failed run:

main.140634964126720.log

staviq commented 1 year ago

If you can, please post main.xxxx.log file from the failing run.

@staviq Sorry I forgot. Here is a failed run:

main.140634964126720.log

Thank you.

I can see anything obviously wrong, can you check if it's reproducible if you use the exact seed this happened with ( add -s 1697866302 to main arguments ) ?

hansejo commented 1 year ago

I can see anything obviously wrong, can you check if it's reproducible if you use the exact seed this happened with ( add -s 1697866302 to main arguments ) ?

I can reproduce the same issue on many different seeds.

Using 1697866302 seed with hermes model:

main.140428480134144.log

user
Explain what Linux is.
assistant
 S. S. S. S. S. S. S. S. S. S. S. S. S. S.S.S.SS.SSS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.S
S.S.S.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.S.S.S.SS.SS.SS.SS.SS.SS.SS.SS.SS.SS.S.S.S.SS.SS.SS.SS.SS.

Same run again

main.140517456100352.log

user
Explain what Linux is.
assistant
EDCV, I, I, I, I, I, I, I, I, I, I, I, I, Io, İ, İ, İ, İ, Í, İ, I, İ, İ, İ, IC, İ, İ [end of text]

Same run with random seed

main.140017889758208.log

user
Explain what Linux is.
assistant
 toga;s, the 1930's of this generation; that year he was born, he has been born again. The first time he was was born, and so on... The 2048's of this generation, we will call him Mark Galski. He was born in 1956.

I am the worst at what I say about me, but i guess that is true about me too

"I was born in 1978." [end of text]

Reproducing same output as in OP

main.140176545074176.log

user
Explain how Linux can win in the desktop space Apple and Microsoft invest more money into their desktop systems.
assistant
############################################################################################################################################################################################################################################

Failed run with different seed with dolphin model

main.140302194334720.log

system
You are a helpful assistant
user
Explain what linux is
assistant
############################################################################################################################################################################################################################

P.S. I am switching between two models mentioned in OP, so please let me know if you'd want me to stick with one. I am using an RX 580 8GB, and all the above work fine on CPU, and seem to work with via OpenCL. So I am narrowing this down to AMD's HIP

staviq commented 1 year ago

I am using an RX 580 8GB, and all the above work fine on CPU, and seem to work with via OpenCL. So I am narrowing this down to AMD's HIP

Yeah, I couldn't think of any reason for why would this happen and this was one of my guesses. I'm gonna label this as AMD specific.

wizd commented 1 year ago

I have the same problem with dual 7900 XTX:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 160.00 MB
llama_new_context_with_model: kv self size  =  160.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 151.63 MB
llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
llama_new_context_with_model: total VRAM used: 39703.71 MB (model: 39398.70 MB, context: 305.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 100, n_keep = 0

tell me a long story순#########################################################################################
ganakee commented 9 months ago

I am having this issue with an AMD 6650M on gfx 10.3.0 in 2024-01-19 and ROCm 6.0 on Linux (POP OS 22.04).

Nearly all models produce extensive garbage output with either # or \n characters. I sample with PHI, Mistral, and Llama2-chat.

A fairly simple prompt may result in hundreds or more newline (\n) responses and might fail due to length.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.