Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.22k stars 772 forks source link

pypi trouble - ModuleNotFoundError #1605

Closed liquidcarbon closed 1 year ago

liquidcarbon commented 1 year ago

Tried to run two @pypi-decorated flows: 1) FractalFlow from https://docs.metaflow.org/scaling/dependencies/libraries 2) modified Playlist plus, swapping conda for pypi

In both cases, the flow errors out, saying the requested package is not found. What am I doing wrong?

thanks!

from metaflow import FlowSpec, step, Parameter, conda, conda_base, pypi

def get_python_version():
    """
    A convenience function to get the python version used to run this
    tutorial. This ensures that the conda environment is created with an
    available version of python.

    """
    import platform

    versions = {"2": "2.7.15", "3": "3.9.10"}
    return versions[platform.python_version_tuple()[0]]

class PlayListFlow(FlowSpec):
    """
    The next version of our playlist generator that adds a 'hint' parameter to
    choose a bonus movie closest to the 'hint'.

    The flow performs the following steps:

    1) Load the genre-specific statistics from the MovieStatsFlow.
    2) In parallel branches:
       - A) Build a playlist from the top films in the requested genre.
       - B) Choose a bonus movie that has the closest string edit distance to
         the user supplied hint.
    3) Join the two to create a movie playlist and display it.

    """

    genre = Parameter(
        "genre", help="Filter movies for a particular genre.", default="Sci-Fi"
    )

    hint = Parameter(
        "hint",
        help="Give a hint to the bonus movie algorithm.",
        default="Metaflow Release",
    )

    recommendations = Parameter(
        "recommendations",
        help="The number of movies recommended for the playlist.",
        default=5,
    )

    @step
    def start(self):
        """
        Use the Metaflow client to retrieve the latest successful run from our
        MovieStatsFlow and assign them as data artifacts in this flow.

        """
        # Load the analysis from the MovieStatsFlow.
        from metaflow import Flow, get_metadata

        # Print metadata provider
        print("Using metadata provider: %s" % get_metadata())

        # Load the analysis from the MovieStatsFlow.
        run = Flow("MovieStatsFlow").latest_successful_run
        print("Using analysis from '%s'" % str(run))

        # Get the dataframe from the start step before we sliced into into
        # genre-specific dataframes.
        self.dataframe = run["start"].task.data.dataframe

        # Also grab the summary statistics.
        self.genre_stats = run.data.genre_stats

        # Compute our two recommendation types in parallel.
        self.next(self.bonus_movie, self.genre_movies)

    @pypi(python='3.11.5', packages={"editdistance": "0.6.2"})  # tried 3.9.10 as well
    #@conda(libraries={"editdistance": "0.5.3"})
    @step
    def bonus_movie(self):
        """
        Use the user supplied 'hint' argument to choose a bonus movie that has
        the closest string edit distance to the hint.

        This step uses 'conda' to isolate the environment. Note that the
        package 'editdistance' need not be installed in your python
        environment.
        """

        import sys
        print(sys.executable)
        print(sys.path)

        import editdistance

        # Define a helper function to compute the similarity between two
        # strings.
        def _edit_distance(movie_title):
            return editdistance.eval(self.hint, movie_title)

        # Compute the distance and take the argmin to find the closest title.
        distance = [
            _edit_distance(movie_title) for movie_title in self.dataframe["movie_title"]
        ]
        index = distance.index(min(distance))
        self.bonus = (
            self.dataframe["movie_title"][index],
            self.dataframe["genres"][index],
        )

        self.next(self.join)

    @step
    def genre_movies(self):
        """
        Select the top performing movies from the use specified genre.
        """

        from random import shuffle

        # For the genre of interest, generate a potential playlist using only
        # highest gross box office titles (i.e. those in the last quartile).
        genre = self.genre.lower()
        if genre not in self.genre_stats:
            self.movies = []
        else:
            df = self.genre_stats[genre]["dataframe"]
            quartiles = self.genre_stats[genre]["quartiles"]
            self.movies = [
                df["movie_title"][i]
                for i, g in enumerate(df["gross"])
                if g >= quartiles[-1]
            ]

        # Shuffle the content.
        shuffle(self.movies)

        self.next(self.join)

    @step
    def join(self, inputs):
        """
        Join our parallel branches and merge results.

        """
        self.playlist = inputs.genre_movies.movies
        self.bonus = inputs.bonus_movie.bonus

        self.next(self.end)

    @step
    def end(self):
        """
        This step simply prints out the playlist.

        """
        # Print the playlist.
        print("Playlist for movies in genre '%s'" % self.genre)
        for pick, movie in enumerate(self.playlist, start=1):
            print("Pick %d: '%s'" % (pick, movie))
            if pick >= self.recommendations:
                break

        print("Bonus Pick: '%s' from '%s'" % (self.bonus[0], self.bonus[1]))

if __name__ == "__main__":
    PlayListFlow()

returns

a@AK-Desktop-6600:~/code/testmetaflow/metaflow-tutorials$ python 04-playlist-plus/playlist.py --environment=pypi run
Metaflow 2.10.3 executing PlayListFlow for user:a
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
Bootstrapping virtual environment(s) ...
Virtual environment(s) bootstrapped!
2023-10-19 12:58:21.133 Workflow starting (run-id 1697741901127204):
2023-10-19 12:58:21.151 [1697741901127204/start/1 (pid 6589)] Task is starting.
2023-10-19 12:58:21.428 [1697741901127204/start/1 (pid 6589)] Using metadata provider: local@/home/a/code/testmetaflow
2023-10-19 12:58:21.430 [1697741901127204/start/1 (pid 6589)] Using analysis from 'Run('MovieStatsFlow/1697738002913286')'
2023-10-19 12:58:21.505 [1697741901127204/start/1 (pid 6589)] Task finished successfully.
2023-10-19 12:58:21.508 [1697741901127204/bonus_movie/2 (pid 6592)] Task is starting.
2023-10-19 12:58:21.533 [1697741901127204/genre_movies/3 (pid 6597)] Task is starting.
2023-10-19 12:58:21.860 [1697741901127204/genre_movies/3 (pid 6597)] Task finished successfully.
2023-10-19 12:58:22.059 [1697741901127204/bonus_movie/2 (pid 6592)] /home/a/.pyenv/versions/3.11.5/bin/python
2023-10-19 12:58:22.060 [1697741901127204/bonus_movie/2 (pid 6592)] <flow PlayListFlow step bonus_movie> failed:
2023-10-19 12:58:22.064 [1697741901127204/bonus_movie/2 (pid 6592)] Internal error
2023-10-19 12:58:22.065 [1697741901127204/bonus_movie/2 (pid 6592)] Traceback (most recent call last):
2023-10-19 12:58:22.066 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/cli.py", line 1172, in main
2023-10-19 12:58:22.066 [1697741901127204/bonus_movie/2 (pid 6592)] start(auto_envvar_prefix="METAFLOW", obj=state)
2023-10-19 12:58:22.066 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
2023-10-19 12:58:22.066 [1697741901127204/bonus_movie/2 (pid 6592)] return self.main(args, kwargs)
2023-10-19 12:58:22.177 [1697741901127204/bonus_movie/2 (pid 6592)] ^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-10-19 12:58:22.177 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/_vendor/click/core.py", line 782, in main
2023-10-19 12:58:22.177 [1697741901127204/bonus_movie/2 (pid 6592)] rv = self.invoke(ctx)
2023-10-19 12:58:22.177 [1697741901127204/bonus_movie/2 (pid 6592)] ^^^^^^^^^^^^^^^^
2023-10-19 12:58:22.177 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
2023-10-19 12:58:22.177 [1697741901127204/bonus_movie/2 (pid 6592)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2023-10-19 12:58:22.178 [1697741901127204/bonus_movie/2 (pid 6592)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-10-19 12:58:22.178 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
2023-10-19 12:58:22.178 [1697741901127204/bonus_movie/2 (pid 6592)] return ctx.invoke(self.callback, ctx.params)
2023-10-19 12:58:22.178 [1697741901127204/bonus_movie/2 (pid 6592)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-10-19 12:58:22.178 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
2023-10-19 12:58:22.178 [1697741901127204/bonus_movie/2 (pid 6592)] return callback(args, kwargs)
2023-10-19 12:58:22.179 [1697741901127204/bonus_movie/2 (pid 6592)] ^^^^^^^^^^^^^^^^^^^^^^^^^
2023-10-19 12:58:22.179 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/_vendor/click/decorators.py", line 21, in new_func
2023-10-19 12:58:22.179 [1697741901127204/bonus_movie/2 (pid 6592)] return f(get_current_context(), args, kwargs)
2023-10-19 12:58:22.179 [1697741901127204/bonus_movie/2 (pid 6592)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-10-19 12:58:22.179 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/cli.py", line 581, in step
2023-10-19 12:58:22.179 [1697741901127204/bonus_movie/2 (pid 6592)] task.run_step(
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/task.py", line 583, in run_step
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] self._exec_step_function(step_func)
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages/metaflow/task.py", line 57, in _exec_step_function
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] step_function()
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] File "/home/a/code/testmetaflow/metaflow-tutorials/04-playlist-plus/playlist.py", line 96, in bonus_movie
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] import editdistance
2023-10-19 12:58:22.180 [1697741901127204/bonus_movie/2 (pid 6592)] ModuleNotFoundError: No module named 'editdistance'
2023-10-19 12:58:22.181 [1697741901127204/bonus_movie/2 (pid 6592)]
**2023-10-19 12:58:22.181 [1697741901127204/bonus_movie/2 (pid 6592)] ['/home/a/code/testmetaflow/metaflow-tutorials/04-playlist-plus', '/home/a/.pyenv/versions/3.11.5/lib/python311.zip', '/home/a/.pyenv/versions/3.11.5/lib/python3.11', '/home/a/.pyenv/versions/3.11.5/lib/python3.11/lib-dynload', '/home/a/.pyenv/versions/3.11.5/lib/python3.11/site-packages']**
2023-10-19 12:58:22.181 [1697741901127204/bonus_movie/2 (pid 6592)] Task failed.
2023-10-19 12:58:22.182 Workflow failed.
2023-10-19 12:58:22.182 Terminating 0 active tasks...
2023-10-19 12:58:22.182 Flushing logs...
    Step failure:
    Step bonus_movie (task-id 2) failed.

Notice the printout for sys.path (marked with **): the micromamba environment that attempts pip install is absent:

Bootstrapping virtual environment(s) ...
    Pip ran into an error while setting up environment:
    command '/home/a/.metaflowconfig/micromamba/bin/micromamba run --prefix /home/a/micromamba/envs/metaflow/linux-64/91bcc23b46dd043 pip3 --disable-pip-version-check --no-input --no-color --isolated install --dry-run --only-binary=:all: --upgrade-strategy=only-if-needed --target=/tmp/tmpzpo2ku8f --report=/tmp/tmpzpo2ku8f/report.json --progress-bar=off --quiet --abi none --abi abi3 --abi cp39 --platform manylinux_2_27_x86_64 --platform manylinux_2_17_x86_64 --platform manylinux_2_20_x86_64 --platform manylinux2010_x86_64 --platform any --platform manylinux_2_26_x86_64 --platform manylinux_2_23_x86_64 --platform manylinux_2_24_x86_64 --platform manylinux2014_x86_64 --platform manylinux_2_18_x86_64 --platform manylinux1_x86_64 --platform manylinux_2_19_x86_64 --platform manylinux_2_25_x86_64 --platform manylinux_2_21_x86_64 --platform linux_x86_64 requests>=2.21.0 editdistance==0.5.3' returned error (1)
    ERROR: Could not find a version that satisfies the requirement editdistance==0.5.3 (from versions: 0.6.0, 0.6.1, 0.6.2)
    ERROR: No matching distribution found for editdistance==0.5.3

(that's why I changed to editdistance==0.6.2)

liquidcarbon commented 1 year ago

just found that each execution attempt created a new environment in:

ll ~/micromamba/envs/metaflow/linux-64/ total 44 drwxr-xr-x 11 a a 4096 Oct 19 12:58 ./ drwxr-xr-x 3 a a 4096 Oct 19 12:34 ../ drwxr-xr-x 13 a a 4096 Oct 19 12:58 0d7472272e27b0c/ drwxr-xr-x 13 a a 4096 Oct 19 12:52 21a849a71a17fcd/ drwxr-xr-x 12 a a 4096 Oct 19 12:34 7264f22a78a4409/ drwxr-xr-x 13 a a 4096 Oct 19 12:35 7bd8def62c9d5d5/ drwxr-xr-x 12 a a 4096 Oct 19 12:57 91bcc23b46dd043/ drwxr-xr-x 13 a a 4096 Oct 19 12:39 9db3d6f40db3e76/ drwxr-xr-x 12 a a 4096 Oct 19 12:38 c0dca06c9139377/ drwxr-xr-x 12 a a 4096 Oct 19 12:34 d2bfadb458f041e/ drwxr-xr-x 13 a a 4096 Oct 19 12:48 db23aaf3108b082/

and in 0d7472272e27b0c/lib/python3.9/site-packages/ there is the requested editdistance, but somehow metaflow is not actually using that environment

liquidcarbon commented 1 year ago

It's surprising that @pypi is using micromamba pip to install things. Why not regular pip from the same environment as python executable? Isn't it possible to steer clear of all serpents?

liquidcarbon commented 1 year ago

https://github.com/Netflix/metaflow/pull/1581 related? @savingoyal

savingoyal commented 1 year ago

@liquidcarbon we are triaging this