ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.3k stars 8.76k forks source link

api_like_OAI.py different GPT-GUIs hanging in response #3654

Closed ahoepf closed 3 months ago

ahoepf commented 8 months ago

Prerequisites

Current Behavior

I am trying to use different chatgpt GUIs with the api_like_OAI.py. To do this, I change the api_base to my api_like_OAI endpoint. The strange thing is that it works for some applications and not for others. The APP-ChatBoost works very well with api_like_OAI on Android, as well as with the "continue" plugin on VS Code. However, when I try the same with "librechat", "ChatGPTBox", "chatboxai.app", etc., the applications remain in a waiting mode and nothing happens. In the log mode, I can see that the llama.cpp server has responded and there is also a response from the api_like_OAI.py."

Environment and Context

$ python api_like_OAI.py --api-key 123456 --host 127.0.0.1 --user-name "user" --system-name "assistant" $ ./server -c 16000 --host 127.0.0.1 -t 16 -ngl 43 -m ../../../text-generation-webui/models/mistral-7b-instruct-v0.1.Q6_K.gguf --embedding --alias gpt-3.5-turbo -v

$ lscpu

Architektur: x86_64 CPU Operationsmodus: 32-bit, 64-bit Adressgrößen: 48 bits physical, 48 bits virtual Byte-Reihenfolge: Little Endian CPU(s): 16 Liste der Online-CPU(s): 0-15 Anbieterkennung: AuthenticAMD Modellname: AMD Ryzen 7 5700G with Radeon Graphics Prozessorfamilie: 25 Modell: 80 Thread(s) pro Kern: 2 Kern(e) pro Socket: 8 Sockel: 1 Stepping: 0 Frequenzanhebung: aktiviert Maximale Taktfrequenz der CPU: 4675,7808 Minimale Taktfrequenz der CPU: 1400,0000 BogoMIPS: 7985.45 Markierungen: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_goo d nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fm a cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_lega cy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_ps tate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_ll c cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassis ts pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vae s vpclmulqdq rdpid overflow_recov succor smca fsrm Virtualization features: Virtualisierung: AMD-V Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 4 MiB (8 instances) L3: 16 MiB (1 instance) NUMA: NUMA-Knoten: 1 NUMA-Knoten0 CPU(s): 0-15 Schwachstellen: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Mitigation; safe RET, no microcode Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PB RSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

$ uname -a Linux ELITE-V2 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 13:12:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ python3 --version
Python 3.10.13
$ make --version
GNU Make 4.3
Gebaut für x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
Lizenz GPLv3+: GNU GPL Version 3 oder später <http://gnu.org/licenses/gpl.html>
Dies ist freie Software: Sie können sie nach Belieben ändern und weiter verteilen.
Soweit es die Gesetze erlauben gibt es KEINE GARANTIE.`
$ g++ --version
GNU Make 4.3
Gebaut für x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
Lizenz GPLv3+: GNU GPL Version 3 oder später <http://gnu.org/licenses/gpl.html>
Dies ist freie Software: Sie können sie nach Belieben ändern und weiter verteilen.
Soweit es die Gesetze erlauben gibt es KEINE GARANTIE.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

OPENAI_REVERSE_PROXY=http://127.0.0.1:8081/v1/chat/completions OPENAI_API_KEY=123456

$ python api_like_OAI.py --api-key 123456 --host 127.0.0.1 --user-name "user" --system-name "assistant" $ ./server -c 16000 --host 0.0.0.0 -t 16 -ngl 43 -m ../../../text-generation-webui/models/mistral-7b-instruct-v0.1.Q6_K.gguf --embedding --alias gpt-3.5-turbo -v

Failure Logs

no errorlogs it hangs on waiting

llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q6_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  19:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q6_K
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 5.53 GiB (6.56 BPW) 
llm_load_print_meta: general.name   = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P40) as main device
llm_load_tensors: mem required  =  102.63 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 5563.55 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16000
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2000.00 MB
llama_new_context_with_model: kv self size  = 2000.00 MB
llama_new_context_with_model: compute buffer total size = 1061.13 MB
llama_new_context_with_model: VRAM scratch buffer: 1055.25 MB
llama_new_context_with_model: total VRAM used: 8618.81 MB (model: 5563.55 MB, context: 3055.25 MB)

llama server listening at http://0.0.0.0:8080

{"timestamp":1697557681,"level":"INFO","function":"main","line":1623,"message":"HTTP server listening","hostname":"0.0.0.0","port":8080}
LynxPDA commented 8 months ago

Same problem

superchargez commented 8 months ago

{"timestamp":1697715227,"level":"INFO","function":"main","line":1755,"message":"HTTP server listening","hostname":"0.0.0.0","port":8080} Segmentation fault

getting different error from you, however, problem appears only in latest update

tjohnman commented 8 months ago

I don't know if this is the problem you're having, but api_like_OAI.py does not implement /v1/models, on which some applications depend. The current version of chatbot-ui as of writing this comment, for example, will not work because it tries to fetch a list of models right off the bat.

Implementing /v1/models should be fairly easy, though.

ahoepf commented 8 months ago

For the models this work for me:

@app.route('/v1/models', methods=['GET']) def get_models(): response = { "object": "list", "data": [ { "id": "gpt-3.5-turbo", "object": "model", "created": 1677610602, "owned_by": "openai", "permission": [ { "id": "modelperm-SFxxxxxxxxxxxxxxxxxxxxxxxxx", "object": "model_permission", "created": 1697465932, "allow_create_engine": False, "allow_sampling": True, "allow_logprobs": True, "allow_search_indices": False, "allow_view": True, "allow_fine_tuning": False, "organization": "*", "group": None, "is_blocking": False } ], "root": "gpt-3.5-turbo", "parent": None } ] } return jsonify(response)

add this to the api_like_OAI.py.

tjohnman commented 8 months ago

Thanks @ahoepf this solved it for me. I did the changes to the script here if anyone wants them.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.