Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve

xieshenzh commented 3 months ago

I tried to deploy llama-3.1-8b-instruct:1.1.1 with Kserve and modelcar on Openshift AI.

What I have done?

Downloaded the models files: podman run --rm -e NGC_API_KEY=<API_KEY> -v /models:/opt/nim/.cache nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1 create-model-store --profile <PROFILE> --model-store /opt/nim/.cache.

Built a modelcar image by copying the models files, using this Dockerfile:

FROM --platform=linux/amd64 busybox
RUN mkdir /models && chmod 775 /models
COPY /models/ /models/

Setup the environment based on the guide.

Deployed the ServingRuntime CR and set the NIM_MODEL_NAME environment variable to /mnt/models/ which is the path where model files mounted from the modelcar container.

---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: nvidia-nim-llama-3.1-8b-instruct-1.1.1
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: '8000'
serving.kserve.io/enable-metric-aggregation: 'true'
serving.kserve.io/enable-prometheus-scraping: 'true'
containers:
- env:
    - name: NIM_MODEL_NAME
      value: /mnt/models/
    - name: NIM_SERVED_MODEL_NAME
      value: meta/llama3-8b-instruct
    - name: NGC_API_KEY
      valueFrom:
        secretKeyRef:
          key: NGC_API_KEY
          name: nvidia-nim-secrets
  image: 'nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1'
  name: kserve-container
  ports:
    - containerPort: 8000
      protocol: TCP
  resources:
    limits:
      cpu: '12'
      memory: 32Gi
    requests:
      cpu: '12'
      memory: 32Gi
  volumeMounts:
    - mountPath: /dev/shm
      name: dshm
imagePullSecrets:
- name: ngc-secret
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- autoSelect: true
  name: nvidia-nim-llama-3.1-8b-instruct
  priority: 1
  version: 1.1.1
volumes:
- emptyDir:
    medium: Memory
    sizeLimit: 25Gi
  name: dshm

Deployed the InferenceService CR and set the storageUri to use the modelcar image created in 2.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
autoscaling.knative.dev/target: '10'
name: llama-3-1-8b-instruct-1xgpu
spec:
predictor:
minReplicas: 1
model:
  modelFormat:
    name: nvidia-nim-llama-3.1-8b-instruct
  name: ''
  resources:
    limits:
      nvidia.com/gpu: '1'
    requests:
      nvidia.com/gpu: '1'
  runtime: nvidia-nim-llama-3.1-8b-instruct-1.1.1
  storageUri: 'oci://<modelcar image registry and name>:<tag>'

The Pod failed to start due to an error:


Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 702, in <module>
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 33, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 304, in from_engine_args
return cls(
File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 278, in __init__
self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)
File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 505, in _init_engine
return engine_class(*args, **kwargs)
File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 136, in __init__
self._tllm_engine = TrtllmModelRunner(
File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 275, in __init__
self._tllm_exec, self._cfg = self._create_engine(
File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine
return create_trt_executor(
File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 283, in create_trt_executor
engine_size_bytes = _get_rank_engine_file_size_bytes(profile_dir)
File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 226, in _get_rank_engine_file_size_bytes
engine_size_bytes = rank0_engine.stat().st_size
File "/usr/lib/python3.10/pathlib.py", line 1097, in stat
return self._accessor.stat(self, follow_symlinks=follow_symlinks)
FileNotFoundError: [Errno 2] No such file or directory: '/models/trtllm_engine/rank0.engine'


**Issue:**
The directory containing model files in the sidecar container is correctly mounted to the NIM container with a symlink:

(Scripts executed in the terminal of the NIM container)

$ ls -al /mnt/models lrwxrwxrwx. 1 1001090000 1001090000 20 Aug 7 20:34 /mnt/models -> /proc/76/root/models $ ls -al /proc/76/root/models/trtllm_engine/rank0.engine -rw-r--r--. 1 root root 16218123260 Jul 30 18:18 /proc/76/root/models/trtllm_engine/rank0.engine



Code of the NIM container invokes function`_get_rank_engine_file_size_bytes` in `vllm_nvext/trtllm/utils.py` which calls [Path.resolve()](https://docs.python.org/3/library/pathlib.html#pathlib.Path.resolve) to resolve the symlink.
As a result, the directory containing the rank engine file (i.e. `/proc/76/root/models/trtllm_engine/rank0.engine`) is resolved to `/models/trtllm_engine/rank0.engine` which is invalid. 
Then, the code could not find the file `/models/trtllm_engine/rank0.engine` to get its file size, and threw the error.

**What I expect?**
NIM container should properly resolve the symlink to the directory containing the model files.

mpaulgreen commented 3 months ago

@supertetelman can you take a look into the issue.

mosfeets commented 2 months ago

@xieshenzh thanks for reporting this, I'm trying to do the exact same thing. Followed your procedure and got the same results with the nvidia-nim-llama-3.1-8b-instruct-1.1.2 image

My overall thought is to pre-cache new NIM models with modelcars on each of my OpenShift nodes using image puller and let KServe do its thing for faster scale up when necessary.

NVIDIA / nim-deploy

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64