NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
611 stars 83 forks source link

Add ability to build using RAPIDS nightly #231

Closed praateekmahajan closed 2 months ago

praateekmahajan commented 2 months ago

Description

We wish to build nightly of curator using nightly of RAPIDS. (https://github.com/NVIDIA/NeMo-Curator/issues/133) We add a simple conditional on an environment variable IS_NIGHTLY that decides the path correctly. The user needs to provide the correct index-url.

Usage

IS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple ".[cuda12x]"
IS_NIGHTLY=0 pip install --extra-index-url=https://pypi.nvidia.com --no-cache-dir ".[cuda12x]"

Dockerfile to test this

FROM rapidsai/ci-conda:cuda12.5.1-ubuntu22.04-py3.10

RUN conda create -y --name rapids -c conda-forge -c nvidia \
    python=3.10 \
    cuda-toolkit=12.5

RUN mamba run -n rapids pip install cython pytest --no-cache-dir
RUN mamba run -n rapids IS_NIGHTLY=0 pip install --extra-index-url= https://pypi.nvidia.com --no-cache-dir ".[cuda12x]"

Questions for reviewer

  1. Should our nightly RAPIDS_VERSION be similar as non-nightly, i.e24.10 vs 24.8
  2. strtobool(..) or just os.get(..) or something else?
    • problem with os.get(..) is that it might not be obvious to the user why IS_NIGHTLY=0 also installs nightly version since from CLI it's interpreted as a string and if "0" will resolve to true.
    • con with strtobool is it'll get deprecated in 3.12 and maybe an overkill?

Checklist