NVIDIA / nim-deploy

A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deployment.
Apache License 2.0
120 stars 49 forks source link

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64

Open xieshenzh opened 1 month ago

xieshenzh commented 1 month ago

I tried to deploy llama-3.1-8b-instruct:1.1.1 with Kserve and modelcar on Openshift AI.

What I have done?

  1. Downloaded the models files: podman run --rm -e NGC_API_KEY=<API_KEY> -v /models:/opt/nim/.cache nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1 create-model-store --profile <PROFILE> --model-store /opt/nim/.cache.
  2. Built a modelcar image by copying the models files, using this Dockerfile:
    FROM --platform=linux/amd64 busybox
    RUN mkdir /models && chmod 775 /models
    COPY /models/ /models/
  3. Setup the environment based on the guide.
  4. Deployed the ServingRuntime CR and set the NIM_MODEL_NAME environment variable to /mnt/models/ which is the path where model files mounted from the modelcar container.
    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    name: nvidia-nim-llama-3.1-8b-instruct-1.1.1
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: '8000'
    serving.kserve.io/enable-metric-aggregation: 'true'
    serving.kserve.io/enable-prometheus-scraping: 'true'
    - env:
        - name: NIM_MODEL_NAME
          value: /mnt/models/
        - name: NIM_SERVED_MODEL_NAME
          value: meta/llama3-8b-instruct
        - name: NGC_API_KEY
              key: NGC_API_KEY
              name: nvidia-nim-secrets
      image: 'nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1'
      name: kserve-container
        - containerPort: 8000
          protocol: TCP
          cpu: '12'
          memory: 32Gi
          cpu: '12'
          memory: 32Gi
        - mountPath: /dev/shm
          name: dshm
    - name: ngc-secret
    - v2
    - grpc-v2
    - autoSelect: true
      name: nvidia-nim-llama-3.1-8b-instruct
      priority: 1
      version: 1.1.1
    - emptyDir:
        medium: Memory
        sizeLimit: 25Gi
      name: dshm
  5. Deployed the InferenceService CR and set the storageUri to use the modelcar image created in 2.
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    autoscaling.knative.dev/target: '10'
    name: llama-3-1-8b-instruct-1xgpu
    minReplicas: 1
        name: nvidia-nim-llama-3.1-8b-instruct
      name: ''
          nvidia.com/gpu: '1'
          nvidia.com/gpu: '1'
      runtime: nvidia-nim-llama-3.1-8b-instruct-1.1.1
      storageUri: 'oci://<modelcar image registry and name>:<tag>'
  6. The Pod failed to start due to an error:
    Traceback (most recent call last):
    File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 702, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 33, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 304, in from_engine_args
    return cls(
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 278, in __init__
    self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)
    File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 505, in _init_engine
    return engine_class(*args, **kwargs)
    File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 136, in __init__
    self._tllm_engine = TrtllmModelRunner(
    File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 275, in __init__
    self._tllm_exec, self._cfg = self._create_engine(
    File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine
    return create_trt_executor(
    File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 283, in create_trt_executor
    engine_size_bytes = _get_rank_engine_file_size_bytes(profile_dir)
    File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 226, in _get_rank_engine_file_size_bytes
    engine_size_bytes = rank0_engine.stat().st_size
    File "/usr/lib/python3.10/pathlib.py", line 1097, in stat
    return self._accessor.stat(self, follow_symlinks=follow_symlinks)
    FileNotFoundError: [Errno 2] No such file or directory: '/models/trtllm_engine/rank0.engine'

The directory containing model files in the sidecar container is correctly mounted to the NIM container with a symlink:

(Scripts executed in the terminal of the NIM container)

$ ls -al /mnt/models lrwxrwxrwx. 1 1001090000 1001090000 20 Aug 7 20:34 /mnt/models -> /proc/76/root/models $ ls -al /proc/76/root/models/trtllm_engine/rank0.engine -rw-r--r--. 1 root root 16218123260 Jul 30 18:18 /proc/76/root/models/trtllm_engine/rank0.engine

Code of the NIM container invokes function`_get_rank_engine_file_size_bytes` in `vllm_nvext/trtllm/utils.py` which calls [Path.resolve()](https://docs.python.org/3/library/pathlib.html#pathlib.Path.resolve) to resolve the symlink.
As a result, the directory containing the rank engine file (i.e. `/proc/76/root/models/trtllm_engine/rank0.engine`) is resolved to `/models/trtllm_engine/rank0.engine` which is invalid. 
Then, the code could not find the file `/models/trtllm_engine/rank0.engine` to get its file size, and threw the error.

**What I expect?**
NIM container should properly resolve the symlink to the directory containing the model files.
mpaulgreen commented 1 month ago

@supertetelman can you take a look into the issue.

mosfeets commented 2 days ago

@xieshenzh thanks for reporting this, I'm trying to do the exact same thing. Followed your procedure and got the same results with the nvidia-nim-llama-3.1-8b-instruct-1.1.2 image

My overall thought is to pre-cache new NIM models with modelcars on each of my OpenShift nodes using image puller and let KServe do its thing for faster scale up when necessary.