NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
715 stars 111 forks source link

[QST] Status: CUDA driver version is insufficient for CUDA runtime version #1090

Open dking21st opened 6 months ago

dking21st commented 6 months ago

❓ Questions & Help

Using merlin tensorflow container to build a docker image but it shows an error:

2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return _run_code(code, main_globals, None,
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     exec(code, run_globals)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/ads_content/batch/scripts/ads/ads_content/preranking/train_ohouse_ads_content_merlin.py", line 15, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     import merlin.models.tf as mm
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/__init__.py", line 108, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     from merlin.models.tf.models.retrieval import (
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/retrieval.py", line 22, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     from merlin.models.tf.prediction_tasks.retrieval import ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 33, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     class ItemRetrievalTask(MultiClassClassificationTask):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 70, in ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     DEFAULT_METRICS = TopKMetricsAggregator.default_metrics(top_ks=[10])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [INFO]: sparse_operation_kit is imported
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Initialize finished, communication tool: horovod
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 491, in default_metrics
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     metrics.extend([RecallAt(k), MRRAt(k), NDCGAt(k), AvgPrecisionAt(k), PrecisionAt(k)])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 362, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(recall_at, k=k, pre_sorted=pre_sorted, name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 234, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/dtensor/utils.py", line 144, in _wrap_function
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     init_method(instance, *args, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 613, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 430, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     self.total = self.add_weight("total", initializer="zeros")
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 366, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return super().add_weight(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 712, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     variable = self._add_variable_with_custom_getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/trackable/base.py", line 489, in _add_variable_with_custom_getter
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     new_variable = getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer_utils.py", line 134, in make_variable
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return tf1.Variable(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     raise e.with_traceback(filtered_tb) from None
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/initializers/initializers.py", line 171, in __call__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return tf.zeros(shape, dtype)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Details

Here is my docker file:

FROM --platform=linux/amd64 nvcr.io/nvidia/merlin/merlin-tensorflow:23.06 as prod

WORKDIR /ads_content

COPY ./data-airflow .
COPY ./ads/images/requirements.txt .

WORKDIR /root

RUN pip install tf2onnx==1.15.1 
RUN pip install -r /ads_content/requirements.txt
RUN pip install requests "urllib3<2"

WORKDIR /ads_content

ENTRYPOINT ["python3"]

I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble. I think I kept things pretty simple with my docker file - what am I doing wrong? Should I install cudf again on that base image? or something else?

rnyak commented 6 months ago

@dking21st hello. can you please share the HW specs, CUDA version and driver version on your AWS instance? are you able to see nvidia-smi output on that instance?