NVIDIA / nim-deploy

A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deployment.
https://build.nvidia.com/
Apache License 2.0
143 stars 64 forks source link

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64

Open xieshenzh opened 3 months ago

xieshenzh commented 3 months ago

I tried to deploy llama-3.1-8b-instruct:1.1.1 with Kserve and modelcar on Openshift AI.

What I have done?

  1. Downloaded the models files: podman run --rm -e NGC_API_KEY=<API_KEY> -v /models:/opt/nim/.cache nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1 create-model-store --profile <PROFILE> --model-store /opt/nim/.cache.
  2. Built a modelcar image by copying the models files, using this Dockerfile:
    FROM --platform=linux/amd64 busybox
    RUN mkdir /models && chmod 775 /models
    COPY /models/ /models/
  3. Setup the environment based on the guide.
  4. Deployed the ServingRuntime CR and set the NIM_MODEL_NAME environment variable to /mnt/models/ which is the path where model files mounted from the modelcar container.
    ---
    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
    name: nvidia-nim-llama-3.1-8b-instruct-1.1.1
    spec:
    annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: '8000'
    serving.kserve.io/enable-metric-aggregation: 'true'
    serving.kserve.io/enable-prometheus-scraping: 'true'
    containers:
    - env:
        - name: NIM_MODEL_NAME
          value: /mnt/models/
        - name: NIM_SERVED_MODEL_NAME
          value: meta/llama3-8b-instruct
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              key: NGC_API_KEY
              name: nvidia-nim-secrets
      image: 'nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1'
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      resources:
        limits:
          cpu: '12'
          memory: 32Gi
        requests:
          cpu: '12'
          memory: 32Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
    imagePullSecrets:
    - name: ngc-secret
    protocolVersions:
    - v2
    - grpc-v2
    supportedModelFormats:
    - autoSelect: true
      name: nvidia-nim-llama-3.1-8b-instruct
      priority: 1
      version: 1.1.1
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 25Gi
      name: dshm
  5. Deployed the InferenceService CR and set the storageUri to use the modelcar image created in 2.
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
    annotations:
    autoscaling.knative.dev/target: '10'
    name: llama-3-1-8b-instruct-1xgpu
    spec:
    predictor:
    minReplicas: 1
    model:
      modelFormat:
        name: nvidia-nim-llama-3.1-8b-instruct
      name: ''
      resources:
        limits:
          nvidia.com/gpu: '1'
        requests:
          nvidia.com/gpu: '1'
      runtime: nvidia-nim-llama-3.1-8b-instruct-1.1.1
      storageUri: 'oci://<modelcar image registry and name>:<tag>'
  6. The Pod failed to start due to an error:
    
    Traceback (most recent call last):
    File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 702, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 33, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 304, in from_engine_args
    return cls(
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 278, in __init__
    self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)
    File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 505, in _init_engine
    return engine_class(*args, **kwargs)
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 136, in __init__
    self._tllm_engine = TrtllmModelRunner(
    File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 275, in __init__
    self._tllm_exec, self._cfg = self._create_engine(
    File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine
    return create_trt_executor(
    File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 283, in create_trt_executor
    engine_size_bytes = _get_rank_engine_file_size_bytes(profile_dir)
    File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 226, in _get_rank_engine_file_size_bytes
    engine_size_bytes = rank0_engine.stat().st_size
    File "/usr/lib/python3.10/pathlib.py", line 1097, in stat
    return self._accessor.stat(self, follow_symlinks=follow_symlinks)
    FileNotFoundError: [Errno 2] No such file or directory: '/models/trtllm_engine/rank0.engine'

**Issue:**
The directory containing model files in the sidecar container is correctly mounted to the NIM container with a symlink:

(Scripts executed in the terminal of the NIM container)

$ ls -al /mnt/models lrwxrwxrwx. 1 1001090000 1001090000 20 Aug 7 20:34 /mnt/models -> /proc/76/root/models $ ls -al /proc/76/root/models/trtllm_engine/rank0.engine -rw-r--r--. 1 root root 16218123260 Jul 30 18:18 /proc/76/root/models/trtllm_engine/rank0.engine



Code of the NIM container invokes function`_get_rank_engine_file_size_bytes` in `vllm_nvext/trtllm/utils.py` which calls [Path.resolve()](https://docs.python.org/3/library/pathlib.html#pathlib.Path.resolve) to resolve the symlink.
As a result, the directory containing the rank engine file (i.e. `/proc/76/root/models/trtllm_engine/rank0.engine`) is resolved to `/models/trtllm_engine/rank0.engine` which is invalid. 
Then, the code could not find the file `/models/trtllm_engine/rank0.engine` to get its file size, and threw the error.

**What I expect?**
NIM container should properly resolve the symlink to the directory containing the model files.
mpaulgreen commented 3 months ago

@supertetelman can you take a look into the issue.

mosfeets commented 2 months ago

@xieshenzh thanks for reporting this, I'm trying to do the exact same thing. Followed your procedure and got the same results with the nvidia-nim-llama-3.1-8b-instruct-1.1.2 image

My overall thought is to pre-cache new NIM models with modelcars on each of my OpenShift nodes using image puller and let KServe do its thing for faster scale up when necessary.