NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

[QST] best practice to build a docker image to train merlin tensorflow models? #1233

Open dking21st opened 9 months ago

dking21st commented 9 months ago

❓ Questions & Help

I'm trying several different docker base - merlin / tensorflow / rapidsai / nvidia... but I kept fail to avoid version issue & driver issue. I'm new to docker and airflow, and seems like I'm keep getting confused.
I will appreciate if anyone can see my requirements & dockerfile to see what am I doing wrong.

Details

dockerfile that worked best so far, but failed due to cudf issue:

FROM --platform=linux/amd64 tensorflow/tensorflow:2.12.0-gpu as prod

WORKDIR /ads_content

COPY ./data-airflow .
COPY ./ads/images/requirements.txt .

RUN apt-get update && yes|apt-get upgrade

# Add sudo
RUN apt-get -y install sudo

# Adding wget and bzip2
RUN apt-get install -y wget bzip2

RUN apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev gcc-x86-64-linux-gnu

WORKDIR /root

# Set requirements
RUN pip install --upgrade pip
RUN apt-get install -y git

#RAPIDs
RUN pip install --no-cache-dir \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu11==23.4.* dask-cudf-cu11==23.4*

RUN pip install -U git+https://github.com/NVIDIA-Merlin/models.git@release-23.04
RUN pip install -U git+https://github.com/NVIDIA-Merlin/nvtabular.git@release-23.04
RUN pip install -U git+https://github.com/NVIDIA-Merlin/core.git@release-23.04
RUN pip install -U git+https://github.com/NVIDIA-Merlin/dataloader.git@release-23.04
RUN pip install merlin-systems==23.04
RUN pip install tf2onnx==1.15.1

RUN pip install -r /ads_content/requirements.txt

WORKDIR /ads_content

ENTRYPOINT ["python3"]

Problem of this image was that version of cudf is limited to 23.04, as this image uses Python 3.8 (which is not supporting cudf 23.12). This low version of cudf blocked me from using gpu on data processing.

Other base images, like rapidsai / merlin, I'm always experiencing driver issue. I see that merlin is recommended to use with image nvcr.io/nvidia/merlin/merlin-tensorflow:23.06. Can someone share me an example docker file, on how to use existing container from merlin / or somewhere else to train tensorflow model successfully on airflow?