Using merlin tensorflow container to build a docker image but it shows an error:
2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return _run_code(code, main_globals, None,
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - exec(code, run_globals)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/ads_content/batch/scripts/ads/ads_content/preranking/train_ohouse_ads_content_merlin.py", line 15, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - import merlin.models.tf as mm
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/__init__.py", line 108, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - from merlin.models.tf.models.retrieval import (
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/retrieval.py", line 22, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - from merlin.models.tf.prediction_tasks.retrieval import ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 33, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - class ItemRetrievalTask(MultiClassClassificationTask):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 70, in ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - DEFAULT_METRICS = TopKMetricsAggregator.default_metrics(top_ks=[10])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [INFO]: sparse_operation_kit is imported
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Initialize finished, communication tool: horovod
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 491, in default_metrics
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - metrics.extend([RecallAt(k), MRRAt(k), NDCGAt(k), AvgPrecisionAt(k), PrecisionAt(k)])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 362, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - super().__init__(recall_at, k=k, pre_sorted=pre_sorted, name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 234, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - super().__init__(name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/dtensor/utils.py", line 144, in _wrap_function
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - init_method(instance, *args, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 613, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - super().__init__(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 430, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - self.total = self.add_weight("total", initializer="zeros")
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 366, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return super().add_weight(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 712, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - variable = self._add_variable_with_custom_getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/trackable/base.py", line 489, in _add_variable_with_custom_getter
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - new_variable = getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer_utils.py", line 134, in make_variable
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return tf1.Variable(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - raise e.with_traceback(filtered_tb) from None
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/initializers/initializers.py", line 171, in __call__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return tf.zeros(shape, dtype)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Details
My dockerfile:
FROM --platform=linux/amd64 nvcr.io/nvidia/merlin/merlin-tensorflow:23.06 as prod
WORKDIR /ads_content
COPY ./data-airflow .
COPY ./ads/images/requirements.txt .
WORKDIR /root
RUN pip install tf2onnx==1.15.1
RUN pip install -r /ads_content/requirements.txt
RUN pip install requests "urllib3<2"
WORKDIR /ads_content
ENTRYPOINT ["python3"]
I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble.
I think I kept things pretty simple with my docker file - what am I doing wrong?
❓ Questions & Help
Using merlin tensorflow container to build a docker image but it shows an error:
Details
My dockerfile:
I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble. I think I kept things pretty simple with my docker file - what am I doing wrong?