While using GPU enabled podman desktop , unable to get response from /v1/chat/completions endpoint via browser or CLI

rrbanda commented 1 month ago

Bug description

While using GPU enabled podman desktop , inference call to model /service is not responding. It keeps saying loading on browser for so long , same with CLI

Operating system

Apple M1

Installation Method

from ghcr.io/containers/podman-desktop-extension-ai-lab container image

Version

next (development version)

Steps to reproduce

➜  ~ curl --location 'http://localhost:49686/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "What is the capital of France?",
      "role": "user"
    }
  ]
}'

Relevant log output

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: llvmpipe (LLVM 16.0.6, 128 bits) | uma: 0 | fp16: 1 | warp size: 4
ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want.
�llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /models/granite-7b-lab-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = launch-day
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32008
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
�llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32008]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
�llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32008]   = [0.000000, 0.000000, 0.000000, 0.0000...
�llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32008]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 32001
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
�llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 267/32008 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32008
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = launch-day
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32001 '<|pad|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.33 MiB
llm_load_tensors:    Vulkan0 buffer size =  3820.96 MiB
warning: failed to mlock 74489856-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =    70.52 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =   164.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    12.00 MiB
llama_new_context_with_model: graph nodes  = 1060
llama_new_context_with_model: graph splits = 2
�AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
�Model metadata: {'tokenizer.chat_template': "{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>'+ '\n' + message['content'] + '\n'}}{% elif message['role'] == 'user' %}{{'<|user|>' + '\n' + message['content'] + '\n'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>' + '\n' + message['content'] + '<|endoftext|>' + ('' if loop.last else '\n')}}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '32001', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '2048', 'general.name': 'launch-day', 'llama.vocab_size': '32008', 'general.file_type': '15', 'tokenizer.ggml.add_bos_token': 'false', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '11008', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '32'}
INFO:     Started server process [2]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:33062 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:56268 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:40832 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:40846 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46936 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46942 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:39170 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:33932 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:33938 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:36358 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:36364 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:38066 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:45478 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:45486 - "GET /docs HTTP/1.1" 200 OK
INFO:     192.168.127.1:29758 - "GET /docs HTTP/1.1" 200 OK
INFO:     192.168.127.1:29758 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     127.0.0.1:39534 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:39544 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46702 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:45256 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:45266 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:34738 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:34742 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:33400 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:33172 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:33184 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:38076 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:38082 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:60184 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:55038 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:55048 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:39248 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:39250 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:60950 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:50824 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:50832 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:41906 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:41914 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:33850 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:54576 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:54580 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:57946 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46892 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46906 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:52190 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:52194 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:58100 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:58108 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46058 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:60862 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:60874 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:43978 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:43988 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:46984 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:43916 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:43932 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:37306 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:37312 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:49826 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:37162 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:37164 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:60790 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:60796 - "GET /docs HTTP/1.1" 200 OK

Additional context

No response

rrbanda commented 1 month ago

 ~ podman --version
podman version 5.2.0

rrbanda commented 1 month ago

{
  "Id": "6682db1c0a967dab2c776059f9a3beaac2fa2604aba7800a779f0dbc4c073cf2",
  "Created": "2024-08-06T03:19:03.594571785Z",
  "Path": "sh",
  "Args": [
    "run.sh"
  ],
  "State": {
    "Status": "running",
    "Running": true,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false,
    "Dead": false,
    "Pid": 2298,
    "ExitCode": 0,
    "Error": "",
    "StartedAt": "2024-08-06T03:19:04.019449016Z",
    "FinishedAt": "0001-01-01T00:00:00Z",
    "Health": {
      "Status": "healthy",
      "FailingStreak": 0,
      "Log": [
        {
          "Start": "2024-08-05T23:33:19.947685157-04:00",
          "End": "2024-08-05T23:33:20.179546093-04:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2024-08-05T23:33:25.941785352-04:00",
          "End": "2024-08-05T23:33:26.190816714-04:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2024-08-05T23:33:31.950996882-04:00",
          "End": "2024-08-05T23:33:32.197648029-04:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2024-08-05T23:33:37.938927767-04:00",
          "End": "2024-08-05T23:33:38.208537981-04:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2024-08-05T23:33:43.958068827-04:00",
          "End": "2024-08-05T23:33:44.2248957-04:00",
          "ExitCode": 0,
          "Output": ""
        }
      ]
    }
  },
  "Image": "sha256:496fcef1d8856ef2bf37cd0928ae4f233f5bdbdf44c61571d1689a085cf2c2e5",
  "ResolvConfPath": "/run/containers/storage/overlay-containers/6682db1c0a967dab2c776059f9a3beaac2fa2604aba7800a779f0dbc4c073cf2/userdata/resolv.conf",
  "HostnamePath": "/run/containers/storage/overlay-containers/6682db1c0a967dab2c776059f9a3beaac2fa2604aba7800a779f0dbc4c073cf2/userdata/hostname",
  "HostsPath": "/run/containers/storage/overlay-containers/6682db1c0a967dab2c776059f9a3beaac2fa2604aba7800a779f0dbc4c073cf2/userdata/hosts",
  "LogPath": "",
  "Name": "/infallible_lamarr",
  "RestartCount": 0,
  "Driver": "overlay",
  "Platform": "linux",
  "MountLabel": "system_u:object_r:container_file_t:s0:c1022,c1023",
  "ProcessLabel": "",
  "AppArmorProfile": "",
  "ExecIDs": [
    "9ab8555aaad3f13835221d6bcd3a79c29cc4aed2fb73dbe68f3c8810d6d39df2"
  ],
  "HostConfig": {
    "Binds": [
      "/Users/raghurambanda/.local/share/containers/podman-desktop/extensions-storage/redhat.ai-lab/models/hf.instructlab.granite-7b-lab-GGUF:/models:rw,rprivate,rbind"
    ],
    "ContainerIDFile": "",
    "LogConfig": {
      "Type": "journald",
      "Config": null
    },
    "NetworkMode": "bridge",
    "PortBindings": {
      "8000/tcp": [
        {
          "HostIp": "0.0.0.0",
          "HostPort": "49686"
        }
      ]
    },
    "RestartPolicy": {
      "Name": "no",
      "MaximumRetryCount": 0
    },
    "AutoRemove": false,
    "VolumeDriver": "",
    "VolumesFrom": null,
    "ConsoleSize": [
      0,
      0
    ],
    "Annotations": {
      "io.container.manager": "libpod",
      "io.podman.annotations.label": "disable",
      "org.opencontainers.image.stopSignal": "15",
      "org.systemd.property.KillSignal": "15",
      "org.systemd.property.TimeoutStopUSec": "uint64 10000000"
    },
    "CapAdd": [],
    "CapDrop": [],
    "CgroupnsMode": "",
    "Dns": [],
    "DnsOptions": [],
    "DnsSearch": [],
    "ExtraHosts": [],
    "GroupAdd": [],
    "IpcMode": "shareable",
    "Cgroup": "",
    "Links": null,
    "OomScoreAdj": 0,
    "PidMode": "private",
    "Privileged": false,
    "PublishAllPorts": false,
    "ReadonlyRootfs": false,
    "SecurityOpt": [
      "label=disable"
    ],
    "UTSMode": "private",
    "UsernsMode": "",
    "ShmSize": 65536000,
    "Runtime": "oci",
    "Isolation": "",
    "CpuShares": 0,
    "Memory": 0,
    "NanoCpus": 0,
    "CgroupParent": "",
    "BlkioWeight": 0,
    "BlkioWeightDevice": null,
    "BlkioDeviceReadBps": null,
    "BlkioDeviceWriteBps": null,
    "BlkioDeviceReadIOps": null,
    "BlkioDeviceWriteIOps": null,
    "CpuPeriod": 0,
    "CpuQuota": 0,
    "CpuRealtimePeriod": 0,
    "CpuRealtimeRuntime": 0,
    "CpusetCpus": "",
    "CpusetMems": "",
    "Devices": [
      {
        "PathOnHost": "/dev/dri/card0",
        "PathInContainer": "/dev/dri/card0",
        "CgroupPermissions": ""
      },
      {
        "PathOnHost": "/dev/dri/renderD128",
        "PathInContainer": "/dev/dri/renderD128",
        "CgroupPermissions": ""
      }
    ],
    "DeviceCgroupRules": null,
    "DeviceRequests": null,
    "MemoryReservation": 0,
    "MemorySwap": 0,
    "MemorySwappiness": 0,
    "OomKillDisable": false,
    "PidsLimit": 2048,
    "Ulimits": [
      {
        "Name": "RLIMIT_NPROC",
        "Hard": 4194304,
        "Soft": 4194304
      }
    ],
    "CpuCount": 0,
    "CpuPercent": 0,
    "IOMaximumIOps": 0,
    "IOMaximumBandwidth": 0,
    "MaskedPaths": null,
    "ReadonlyPaths": null
  },
  "GraphDriver": {
    "Data": {
      "LowerDir": "/var/lib/containers/storage/overlay/8a7473e15940de22d1d235ad293c0dbd124b3f6dfba2e0a185d099dee598eebb/diff:/var/lib/containers/storage/overlay/e15fbf9ea9a3f3a7040f8010ea4414179ea9b6bf740f009a0281e9f5f8d86700/diff:/var/lib/containers/storage/overlay/46009ea56de0d875916fddd521245691268c77f9b8087ea48651c57884a3724b/diff:/var/lib/containers/storage/overlay/7962580760defbe2b334758e9429ec74bf32edfad5725faa9180e9f483445797/diff:/var/lib/containers/storage/overlay/0d95c43a69d6f6bfdda3d564003300517fe99fe67ae33388a4feb601d1f3cd16/diff",
      "MergedDir": "/var/lib/containers/storage/overlay/779439d3223f809e6aae2a85a4ee4d75da466dbfebd5462380bca71db0570803/merged",
      "UpperDir": "/var/lib/containers/storage/overlay/779439d3223f809e6aae2a85a4ee4d75da466dbfebd5462380bca71db0570803/diff",
      "WorkDir": "/var/lib/containers/storage/overlay/779439d3223f809e6aae2a85a4ee4d75da466dbfebd5462380bca71db0570803/work"
    },
    "Name": "overlay"
  },
  "SizeRootFs": 0,
  "Mounts": [
    {
      "Type": "bind",
      "Source": "/Users/raghurambanda/.local/share/containers/podman-desktop/extensions-storage/redhat.ai-lab/models/hf.instructlab.granite-7b-lab-GGUF",
      "Destination": "/models",
      "Mode": "",
      "RW": true,
      "Propagation": "rprivate"
    }
  ],
  "Config": {
    "Hostname": "6682db1c0a96",
    "Domainname": "",
    "User": "1001",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "ExposedPorts": {
      "49686/tcp": {},
      "8000/tcp": {},
      "8080/tcp": {}
    },
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "STI_SCRIPTS_PATH=/usr/libexec/s2i",
      "MODEL_PATH=/models/granite-7b-lab-Q4_K_M.gguf",
      "PROMPT_COMMAND=. /opt/app-root/bin/activate",
      "HOME=/opt/app-root/src",
      "HOST=0.0.0.0",
      "FORCE_CMAKE=1",
      "CNB_USER_ID=1001",
      "SUMMARY=Platform for building and running Python 3.11 applications",
      "PLATFORM=el9",
      "MODEL_CHAT_FORMAT=openchat",
      "ENV=/opt/app-root/bin/activate",
      "LC_ALL=en_US.UTF-8",
      "DESCRIPTION=Python 3.11 available as container is a base platform for building and running various Python 3.11 applications and frameworks. Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python's elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.",
      "CMAKE_ARGS=-DLLAMA_VULKAN=on",
      "PIP_NO_CACHE_DIR=off",
      "PYTHONIOENCODING=UTF-8",
      "CNB_GROUP_ID=0",
      "NODEJS_VER=20",
      "GPU_LAYERS=-1",
      "container=oci",
      "PYTHONUNBUFFERED=1",
      "BASH_ENV=/opt/app-root/bin/activate",
      "PORT=8000",
      "APP_ROOT=/opt/app-root",
      "PATH=/opt/app-root/src/.local/bin/:/opt/app-root/src/bin:/opt/app-root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "CNB_STACK_ID=com.redhat.stacks.ubi9-python-311",
      "STI_SCRIPTS_URL=image:///usr/libexec/s2i",
      "LANG=en_US.UTF-8",
      "PYTHON_VERSION=3.11",
      "HOSTNAME=6682db1c0a96"
    ],
    "Cmd": [],
    "Healthcheck": {
      "Test": [
        "CMD-SHELL",
        "curl -sSf localhost:8000/docs > /dev/null"
      ],
      "Interval": 5000000000,
      "Timeout": 30000000000,
      "Retries": 20
    },
    "Image": "quay.io/ai-lab/llamacpp-python-vulkan:latest",
    "Volumes": null,
    "WorkingDir": "/locallm",
    "Entrypoint": [
      "sh",
      "run.sh"
    ],
    "OnBuild": null,
    "Labels": {
      "ai-lab-inference-server": "[\"hf.instructlab.granite-7b-lab-GGUF\"]",
      "api": "http://localhost:49686/v1",
      "architecture": "aarch64",
      "build-date": "2024-02-29T16:28:59",
      "com.redhat.component": "python-311-container",
      "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
      "description": "Python 3.11 available as container is a base platform for building and running various Python 3.11 applications and frameworks. Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python's elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.",
      "distribution-scope": "public",
      "docs": "http://localhost:49686/docs",
      "gpu": "Apple M1 Pro",
      "io.buildah.version": "1.23.1",
      "io.buildpacks.stack.id": "com.redhat.stacks.ubi9-python-311",
      "io.k8s.description": "Python 3.11 available as container is a base platform for building and running various Python 3.11 applications and frameworks. Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python's elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.",
      "io.k8s.display-name": "Python 3.11",
      "io.openshift.expose-services": "8080:http",
      "io.openshift.s2i.scripts-url": "image:///usr/libexec/s2i",
      "io.openshift.tags": "builder,python,python311,python-311,rh-python311",
      "io.s2i.scripts-url": "image:///usr/libexec/s2i",
      "maintainer": "SoftwareCollections.org <sclorg@redhat.com>",
      "name": "ubi9/python-311",
      "release": "52",
      "summary": "Platform for building and running Python 3.11 applications",
      "trackingId": "znu9cg",
      "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/python-311/images/1-52",
      "usage": "s2i build https://github.com/sclorg/s2i-python-container.git --context-dir=3.11/test/setup-test-app/ ubi9/python-311 python-sample-app",
      "vcs-ref": "e62e3648c350ef90416ec6891e59758f1bdfe547",
      "vcs-type": "git",
      "vendor": "Red Hat, Inc.",
      "version": "1"
    },
    "StopSignal": "15",
    "StopTimeout": 10
  },
  "NetworkSettings": {
    "Bridge": "",
    "SandboxID": "",
    "SandboxKey": "/run/netns/netns-1b68ff30-5959-b71b-32a6-74508e6411f8",
    "Ports": {
      "49686/tcp": null,
      "8000/tcp": [
        {
          "HostIp": "0.0.0.0",
          "HostPort": "49686"
        }
      ],
      "8080/tcp": null
    },
    "HairpinMode": false,
    "LinkLocalIPv6Address": "",
    "LinkLocalIPv6PrefixLen": 0,
    "SecondaryIPAddresses": null,
    "SecondaryIPv6Addresses": null,
    "EndpointID": "",
    "Gateway": "10.88.0.1",
    "GlobalIPv6Address": "",
    "GlobalIPv6PrefixLen": 0,
    "IPAddress": "10.88.0.2",
    "IPPrefixLen": 16,
    "IPv6Gateway": "",
    "MacAddress": "66:d0:eb:f6:d0:a6",
    "Networks": {
      "podman": {
        "IPAMConfig": null,
        "Links": null,
        "Aliases": [
          "6682db1c0a96"
        ],
        "MacAddress": "66:d0:eb:f6:d0:a6",
        "DriverOpts": null,
        "NetworkID": "podman",
        "EndpointID": "",
        "Gateway": "10.88.0.1",
        "IPAddress": "10.88.0.2",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "DNSNames": null
      }
    }
  }
}

rrbanda commented 1 month ago

Hardware Overview:

Model Name: MacBook Pro
  Model Identifier: MacBookPro18,3
  Chip: Apple M1 Pro
  Total Number of Cores:    10 (8 performance and 2 efficiency)
  Memory:   32 GB
  System Firmware Version:  10151.121.1
  OS Loader Version:    10151.121.1

cbr7 commented 1 month ago

Also reproduced this on M2 Pro, I've seen huge increase in vcpu usage in the container:

Also observed that sometimes the container crashes after a while of hanging with the error:

run.sh: line 10: 2 Killed python -m llama_cpp.server --model ${MODEL_PATH} --host ${HOST:=0.0.0.0} --port ${PORT:=8001} --n_gpu_layers ${GPU_LAYERS:=0} --clip_model_path ${CLIP_MODEL_PATH:=None} --chat_format ${MODEL_CHAT_FORMAT:="llama-2"}

cbr7 commented 1 month ago

I managed to get some kind of response from the system using the playground, but it took a very long time

https://github.com/user-attachments/assets/dc142688-9ea3-4b57-810d-1f1f686262fd

containers / podman-desktop-extension-ai-lab