LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.16k stars 353 forks source link

Crashing when processing image using MiniCPM 2.6 #1087

Closed jabberjabberjabber closed 1 month ago

jabberjabberjabber commented 2 months ago

Describe the Issue When sending a specific image to kobold using MiniCPM 2.6 the Kobold server crashes. Works with Llava 1.5.

Additional Information: Image causing crash: 2024-02-24_171639 Crash log: crash.txt Script used:

import argparse
import base64
import requests
from PIL import Image
import io

class ImageProcessor:
    def __init__(self):
        pass

    def process_image(self, file_path):
        try:
            with Image.open(file_path) as img:
                if img.mode != 'RGB':
                    img = img.convert('RGB')

                jpeg_bytes = io.BytesIO()
                img.save(jpeg_bytes, format='JPEG', quality=95)
                jpeg_bytes.seek(0)
                base64_encoded = base64.b64encode(jpeg_bytes.getvalue()).decode('utf-8')

            return base64_encoded

        except Exception as e:
            print(f"Error processing image: {str(e)}")
            return None

class LLMProcessor:
    def __init__(self, api_url, api_password):
        self.api_url = api_url
        self.headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_password}",
        }

    def send_image_to_llm(self, base64_image):
        payload = {
            "prompt": "Describe this image in detail.",
            "max_length": 300,
            "images": [base64_image],

        }
        response = requests.post(f"{self.api_url}/api/v1/generate", json=payload, headers=self.headers)
        if response.status_code == 200:
            return response.json()["results"][0].get("text")
        else:
            print(f"Error: {response.status_code} - {response.text}")
            return None

def main():
    parser = argparse.ArgumentParser(description="Send an image to LLM API")
    parser.add_argument("image_path", help="Path to the image file")
    parser.add_argument("--api-url", default="http://localhost:5001", help="URL for the LLM API")
    parser.add_argument("--api-password", default="", help="Password for the LLM API")
    args = parser.parse_args()

    image_processor = ImageProcessor()
    llm_processor = LLMProcessor(args.api_url, args.api_password)

    base64_image = image_processor.process_image(args.image_path)
    if base64_image:
        result = llm_processor.send_image_to_llm(base64_image)
        if result:
            print("LLM Response:")
            print(result)
        else:
            print("Failed to get a response from the LLM.")
    else:
        print("Failed to process the image.")

if __name__ == "__main__":
    main()
LostRuins commented 2 months ago

Could you try resize the image to a power of 64? Try this image instead test does it work?

jabberjabberjabber commented 2 months ago

That works.

jabberjabberjabber commented 2 months ago

Unfortunately crashed again, even with padding:

kobold load

Welcome to KoboldCpp - Version 1.73.1
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(model='/mnt/Orlando/gguf//MiniCPM-V-2-6-Q6_K.gguf', model_param='/mnt/Orlando/gguf//MiniCPM-V-2-6-Q6_K.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=9, usecublas=['rowsplit'], usevulkan=None, useclblast=None, noblas=False, contextsize=8192, gpulayers=999, tensor_split=None, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=9, lora=None, noshift=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, mmproj='/mnt/Orlando/gguf/vision/projectors/minicpm-v-2-6.mmproj-model-f16.gguf', password=None, ignoremissing=False, chatcompletionsadapter='', flashattention=True, quantkv=0, forceversion=0, smartcontext=False, unpack='', hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=0, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)
==========
Loading model: /mnt/Orlando/gguf/MiniCPM-V-2-6-Q6_K.gguf

The reported GGUF Arch is: qwen2
Arch Category: 5

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from /mnt/Orlando/gguf/MiniCPM-V-2-6-Q6_K.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 25
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151666
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 7.61 B
llm_load_print_meta: model size       = 5.82 GiB (6.56 BPW)
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 151644 '<|im_start|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 128244 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.52 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CUDA_Split buffer size =  5530.05 MiB
llm_load_tensors:        CPU buffer size =   425.24 MiB
llm_load_tensors:      CUDA0 buffer size =     1.27 MiB
........................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 8448
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   462.00 MiB
llama_new_context_with_model: KV self size  =  462.00 MiB, K (f16):  231.00 MiB, V (f16):  231.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   310.22 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    23.51 MiB
llama_new_context_with_model: graph nodes  = 875
llama_new_context_with_model: graph splits = 2

Attempting to apply Multimodal Projector: /mnt/Orlando/gguf/vision/projectors/minicpm-v-2-6.mmproj-model-f16.gguf
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from /mnt/Orlando/gguf/vision/projectors/minicpm-v-2-6.mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                      clip.minicpmv_version i32              = 3
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     996.02 MB
clip_model_load: metadata size:  0.19 MB
clip_model_load: params backend buffer size =  996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

kobold error

uhd_slice_image: multiple 1
clip_image_preprocess: 1050 196
clip_image_build_graph: 1050 196
CUDA error: the function failed to launch on the GPU
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml/src/ggml-cuda.cu:1291
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
ggml/src/ggml-cuda.cu:103: CUDA error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

script error

Processing file: c:/users/user/pictures/2024-03-22_195319.jpg
{'SourceFile': 'c:/users/user/pictures/2024-03-22_195319.jpg', 'File:FileType': 'JPEG'}
new_height: 128, new_width: 704
Error calling API: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

padding function

    def pad_to_power_of_64(self, image):
        height, width = image.shape[:2]
        new_width = ((width + 63) // 64) * 64 
        new_height = ((height + 63) // 64) * 64 

        if height == new_height and width == new_width:
            return image

        padded_image = np.zeros((new_height, new_width, 3), dtype=np.uint8)
        padded_image[:height, :width] = image

        print(f"new_height: {new_height}, new_width: {new_width}")

        return padded_image
jabberjabberjabber commented 1 month ago

Seems fixed in 1.74