LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.81k stars 343 forks source link

Regression: Since 1.45.2, only builds in NoAVX2 Mode (Old CPU), Failsafe Mode (Old CPU) #501

Closed e576082c closed 10 months ago

e576082c commented 10 months ago

Expected Behavior

Building new versions from source must succeed with AVX2 ON. I could build past versions up to 1.44.2 with AVX2 ON.

Current Behavior

Does not build with AVX2 since version 1.45.2. Versions 1.46.1 and 1.47.2 also do not build with AVX2 ON. Maybe something is wrong with the makefile. Or it was not documented whether I need to do anything special to make the new versions build with AVX2.

Environment and Context

The problem should not be with my PC. Just to be sure, I re-downloaded the source code of 1.44.2, 1.45.2, 1.46.1, 1.47.2, and then I built all of them again with make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1. All versions failed to build with AVX2, except the good old 1.44.2. Version 1.44.2 works fine, and I am happy with it. All newer version also build, but only in "NoAVX2 Mode (Old CPU), Failsafe Mode (Old CPU)". This makes me sad.

OS: Devuan GNU/Linux, Daedalus version Kernel: linux-xanmod-x64v3 CPU: AMD Ryzen 5 3600 (it has avx2) RAM: 64GB VRAM: 12GB, NVIDIA GPU (should be unrelated)

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 3600 6-Core Processor
    CPU family:          23
    Model:               113
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  86%
    CPU max MHz:         4208.2031
    CPU min MHz:         2200.0000
    BogoMIPS:            7200.42
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization features:                                                                                                                         
  Virtualization:        AMD-V                                                                                                                   
Caches (sum of all):                                                                                                                             
  L1d:                   192 KiB (6 instances)                                                                                                   
  L1i:                   192 KiB (6 instances)                                                                                                   
  L2:                    3 MiB (6 instances)                                                                                                     
  L3:                    32 MiB (2 instances)                                                                                                    
NUMA:                                                                                                                                            
  NUMA node(s):          1                                                                                                                       
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT enabled with STIBP protection
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
$ uname -a
Linux devuan 6.1.60-x64v3-xanmod1 #0~20231026.g84ecc45 SMP PREEMPT_DYNAMIC Thu Oct 26 05:50:00 UTC x86_64 GNU/Linux
$ python3 --version
Python 3.11.2

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu

$ g++ --version
g++ (Debian 12.2.0-14) 12.2.0

$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0

$ bash --version                                                                                                                      
GNU bash, version 5.2.15(1)-release (x86_64-pc-linux-gnu)

Failure Information (for bugs)

I just run make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1. Throws a lot of gibberish. I understand nothing from it. Please see the attached log files. All builds eventually succeed, but only version 1.44.2 seem to have AVX2 Mode ON.

Steps to Reproduce

  1. Download the "Source code (tar.gz)" from the releases page.
  2. Extract it.
  3. $ python3 -m venv ./venv
  4. $ source ./venv/bin/activate
  5. $ pip install -r requirements.txt
  6. make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1 (Maybe I am missing something from here?)
  7. $ python3 koboldcpp.py
  8. Failed to use new GUI. Reason: No module named 'packaging' (Probably unrelated)
  9. $ pip install customtkinter packaging
  10. $ python3 koboldcpp.py
  11. Check what you see: "Presets 4/6 This is the number of backends you have built and available Missing: hipBLAS (ROCm), NoAVX2 Mode (Old CPU), Failsafe Mode (Old CPU)" (hipBLAS (ROCm) must be unrelated.)

Extra environment info (I did not know that this needs torch. Does it? It seem to work without it!):

$ pip list | egrep "torch|numpy|sentencepiece"
numpy         1.24.0
sentencepiece 0.1.98

Runtime info must be unrelated. Just check, that the GUI says "NoAVX2 Mode (Old CPU), Failsafe Mode (Old CPU)".

Failure Logs

See the attachments. 1.44.2-good.log 1.45.2-bad.log 1.46.1-bad.log 1.47.2-bad.log

LostRuins commented 10 months ago

It seems like the build is working. I don't see any build errors, your setup just doesn't support customtkinter? Are you saying that the customtkinter GUI worked for you previously but not anymore?

What seems to be the problem?

e576082c commented 10 months ago

The "requirements.txt" file does certainly have a missing "packaging" dependency. (Yeah, the name of the missing dependency is "packaging" lol). But I think this is unrelated to the problem.

After I succeed building with make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1, and failing to first run, I then always manually install "packaging" into my venv, and then the GUI starts up normally.

The problem is with missing AVX2 support in new versions. At least the GUI says so. Could it be, that this is only a problem with the GUI reporting "NoAVX2 Mode (Old CPU), Failsafe Mode (Old CPU)", but under the hood, AVX2 Mode is actually ON and working? Please look at what I see.

Before: old_good

After: new_bad

LostRuins commented 10 months ago

Ah okay, what you're seeing is actually not a bug but a feature - I should probably make the tooltip clearer. When building in Linux, koboldcpp applies the -march=native -mtune=native flags. These cause the compiler to target the architecture you're currently building on, rather than a specific intrinsic flag. Thus, there's no point in building the non-avx2 versions - this has always been the case, and if you grabbed the *.so files from an older version like 1.44 you'd find the avx2 and non-avx2.so files are similarly functional. So an optimization was made to remove them.

e576082c commented 10 months ago

Thanks, now I get it. This was only a big misunderstanding, because I have not done any performance tests, just blindly believed what the GUI told me.

I checked it out, and the new versions are not slower than version 1.44.2, so it's probably true, that AVX2 support is ON, and only the GUI was deterring me from using the newer versions.

The new versions, if not absolutely necessary, do not build with old, vintage CPU support, and this causes the number of backends to drop from 6 to 4. I thought, that "more backends = more accelerators", so "less backends = less speed", but now I understand that's not how it works. Thanks for clarifying.

I suppose, this non-issue can be closed as solved.

LostRuins commented 10 months ago

Yeah, in future, I will hide those useless options in linux, to avoid confusing users