Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!
https://metaflow.org
Apache License 2.0
7.78k stars 737 forks source link

@conda/@pypi appears to remove CUDA instructions for GPU #1904

Open EdIzaguirre opened 3 days ago

EdIzaguirre commented 3 days ago

Hello,

As the title mentions, I am trying to get a job run on AWS Batch that will run on a GPU. When I omit the @conda/@pypi decorator, I am able to see that a GPU is allocated. However, when I throw on an @conda/@pypi decorator, I am unable to get a GPU appear. Why is this? For reference, I am essentially using this Cloud Formation template.

Here is some simple code to demonstrate the issue. Again this occurs whether I use @pypi or @conda. I am using MetaFlow version 2.12.5. Note that if I omit the tensorflow library in the @conda decorator, I get a ModuleNotFoundError: No module named 'tensorflow', so the @conda/@pypi decorators seems to wipe the Tensorflow library from the compute instance.

from metaflow import FlowSpec, step, batch, conda, conda_base

@conda_base(python='3.12')
class test_gpu(FlowSpec):
    @batch(gpu=1, image="docker.io/tensorflow/tensorflow:latest-gpu", queue="job-queue-gpu-metaflow",)
    @conda(libraries={'tensorflow': '2.16.1'})
    @step
    def start(self):

        import tensorflow as tf
        print("tensorflow" + tf.__version__)

        import sys
        print("Python version")
        print(sys.version)

        print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

        self.next(self.end)

    @step
    def end(self):

        print("All done. \n\n Congratulations!\n")
        return

if __name__ == '__main__':
    test_gpu()

For reference, this is the output:

2024-06-27 14:54:07.462 Creating local datastore in current directory (/Users/ed/Developer/metaflow-test-gpu/.metaflow)
2024-06-27 14:54:07.462 Bootstrapping virtual environment(s) ...
2024-06-27 15:17:46.520 Virtual environment(s) bootstrapped!
2024-06-27 15:17:47.734 Workflow starting (run-id 15):
2024-06-27 15:17:52.723 [15/start/56 (pid 5595)] Task is starting.
2024-06-27 15:17:54.545 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status SUBMITTED)...
2024-06-27 15:17:57.656 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status RUNNABLE)...
2024-06-27 15:18:27.686 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status RUNNABLE)...
2024-06-27 15:18:57.741 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status RUNNABLE)...
2024-06-27 15:19:27.845 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status RUNNABLE)...
2024-06-27 15:19:28.346 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status STARTING)...
2024-06-27 15:19:58.409 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status STARTING)...
2024-06-27 15:20:28.590 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status STARTING)...
2024-06-27 15:20:58.855 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status STARTING)...
2024-06-27 15:21:27.582 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting (status RUNNING)...
2024-06-27 15:21:26.689 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Setting up task environment.
2024-06-27 15:21:34.717 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Downloading code package...
2024-06-27 15:21:35.552 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Code package downloaded.
2024-06-27 15:21:35.590 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task is starting.
2024-06-27 15:21:35.983 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Bootstrapping virtual environment...
2024-06-27 15:22:00.642 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Environment bootstrapped.
2024-06-27 15:22:02.275 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] 2024-06-27 22:22:02.275781: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
2024-06-27 15:22:02.275 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-27 15:22:04.153 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] tensorflow2.16.1
2024-06-27 15:22:04.154 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Python version
2024-06-27 15:22:04.154 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:43:22) [GCC 12.3.0]
2024-06-27 15:22:04.154 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Num GPUs Available:  0
2024-06-27 15:22:31.518 [15/start/56 (pid 5595)] [226c27ff-1f24-431a-8afa-753aa1bbea82] Task finished with exit code 0.
2024-06-27 15:22:32.432 [15/start/56 (pid 5595)] Task finished successfully.
2024-06-27 15:22:33.633 [15/end/57 (pid 8378)] Task is starting.
2024-06-27 15:22:40.407 [15/end/57 (pid 8378)] All done.
2024-06-27 15:22:41.998 [15/end/57 (pid 8378)] 
2024-06-27 15:22:41.998 [15/end/57 (pid 8378)] Congratulations!
2024-06-27 15:22:41.998 [15/end/57 (pid 8378)] 
2024-06-27 15:22:42.793 [15/end/57 (pid 8378)] Task finished successfully.
2024-06-27 15:22:43.116 Done!
savingoyal commented 3 days ago

@EdIzaguirre - @conda and @pypi provide a clean virtual environment. The default tensorflow conda package is not GPU-compatible; have you tried the tensorflow-gpu package instead?

EdIzaguirre commented 2 days ago

When I try using the tensorflow-gpu package instead of tensorflow in the @conda decorator, I get:

Micromamba ran into an error while setting up environment:
command '/Users/ed/.metaflowconfig/micromamba/bin/micromamba create --yes --quiet --dry-run --no-extra-safety-checks --repodata-ttl=86400 --retry-clean-cache --prefix=/var/folders/y6/vkmmb9lj41q4_0dq4q_h0xyh0000gn/T/tmprobsr8sm/prefix --channel=conda-forge --channel=Microsoft --channel=defaults requests==>=2.21.0 boto3==>=1.14.0 tensorflow-gpu==2.6.0 python==3.12' returned error (1)
    nothing provides __glibc >=2.17 needed by tensorflow-base-2.6.0-cuda110py37hb8f09f9_2

. Trying to include the rmg::glibc==2.19 package from conda doesn't fix this. Notably, when I use the @pypi decorator with the tensorflow library, I do see a GPU. However, one of my packages doesn't work with @pypi, so I would like to use the @conda decorator.

romain-intel commented 1 day ago

You may be able to try CONDA_OVERRIDE_GLIBC=2.17 as an env var. You can also try the bleeding edge decorators that allow you to combine conda and pypi (see here: https://docs.metaflow.org/scaling/dependencies/libraries#bleeding-edge-versions-of-the-decorators) and also handle the GLIBC notion a little bit differently. +1 to @savingoyal 's point about package name and the "clean slate". Package names usually match between pypi and conda but that is not always the case; conda still distinguishes the GPU version (it basically adds additional dependencies). Feel free to come on slack too for a more interactive conversation. There is a similar question that was asked there in the last two weeks iirc.