Netflix / metaflow-nflx-extensions

Apache License 2.0
22 stars 11 forks source link

Metaflow Extensions from Netflix

This repository contains extensions for Metaflow that are in use at Netflix (or being tested at Netflix) and that are more cutting edge than what is included in the OSS Metaflow package.

You can find support for this extension on the usual Metaflow Slack.

NOTE: of you are within Netflix and are looking for the Netflix version of Metaflow, this is not it (this only contains a part of the Netflix internal extensions).

Netflix released Metaflow as OSS in 2019. Since then, development of Metaflow internally to Netflix has continued primarily around extensions to better support Netflix's infrastructure and provide a more seamless integration with the compute and orchestration platforms specific to Netflix. Netflix continues to collaboratively improve Metaflow's OSS capabilities in collaboration with OuterBounds and, as such, sometimes develops functionality that is not yet fully ready for inclusion in the community supported Metaflow as interest in the functionality may not be clear or there is not time in the community to properly integrate and fully test the functionality.

This repository will contain such functionality. While we do our best to ensure that the functionality present works, it does not have the same levels of support and backward compatibility guarantees that Metaflow does. Functionality present in this package is likely to end up in the main Metaflow package with, potentially, some modification (in which case it will be removed from this package) but that is not a guarantee. If you find this functionality useful and would like to see it make it to the main Metaflow package, let us know. Feedback is always welcome!

This extension is currently tested on python 3.7+.

If you have any question, feel free to open an issue here or contact us on the usual Metaflow slack channels.

This extension currently contains:

Conda V2

Version 1.0.0 is considered stable. Some UX changes have occurred compared to previous versions. Please see the docs for more information

Version 0.2.0 of this extension is not fully backward compatible with previous versions due to where packages are cached. If you are using a previous version of the extension, it is recommended that you change the CONDA_MAGIC_FILE_V2, CONDA_PACKAGES_DIRNAME and CONDA_ENVS_DIRNAME to new values to be able to have both versions active at the same time.

It is likely to evolve primarily in its implementation as we do further testing. Feedback on what is working and what is not is most welcome.

Main improvements over the standard Conda decorator in Metaflow

This decorator improves several aspects of the included Conda decorator:

Installation

To use, simply install this package alongside the metaflow package. This package requires Metaflow v2.8.3 or later.

Configuration

You have several configuration options that can be set in metaflow_extensions/netflix_ext/config/mfextinit_netflixext.py. Due to limitations in the OSS implementation of decorators such as batch and kubernetes, prior to Metaflow v2.10, you should set these values directly in the mfextinit_netflixext.py configuration file and not in an external configuration or through environment variables. This limitation is removed in Metaflow v2.10.

The useful configuration values are listed below:

Azure specific setup

For Azure, you need to do the following two steps once during setup:

Conda environment requirements

Your local conda environment or the cached environment (in CONDA_LOCAL_DIST_DIRNAME) needs to satisfy the following requirements:

Pure pypi package support

If you want support for environments containing only pip packages, you will also need:

Mixed (pypi + conda) package support

If you want support for environments containing both pip and conda packages, you will also need:

Support for .tar.bz2 and .conda packages

If you set CONDA_PREFERRED_FORMAT to either .tar.bz2 or .conda, for some packages, we will need to transmute them from one format to the other. For example if a package is available for download as a .tar.bz2 package but you request .conda packages, the system will transmute (convert) the .tar.bz2 package into one that ends in .conda. To do so, you need to have one of the following package installed:

Also due to a bug in conda and the way we use it, if your resolved environment contains .conda packages and you do not have micromamba installed, the environment creation will fail.

Known issues

This plugin relies on conda, mamba, and micromamba. These technologies are being constantly improved and there are a few outstanding issues that we are aware of:

Uninstallation

Uninstalling this package will revert the behavior of the conda decorator to the one currently present in Metaflow. It is safe to switch back and forth and there should be no conflict between both implementations provided they do not share the same caching prefix in S3/azure/gs and that you do not use any of the new features.

Usage

Your current code with conda decorators will continue working as is. However, at this time, there is no method to "convert" previously resolved environment to this new implementation so the first time you run Metaflow with this package, your previously resolved environments will be ignored and re-resolved.

Environments that can be resolved

Environments listed below are examples that can be resolved using Metaflow. The environments given here are either in the requirements.txt format or environment.yml format and can, for example, be passed to metaflow environment resolve using the -r or -f option respectively. They highlight some of the functionalities present. Note that the same environments can also be specified directly using the @conda or @pip decorators.

Pure "pypi" environment with non-python Conda packages
--conda-pkg ffmpeg
ffmpeg-python

The requirements.txt file above will create an environment with the Pip package ffmpeg-python as well as the ffmpeg Conda executable. This is useful to have a pure pip environment (and therefore use the underlying pip ecosystem without conda-lock but still have other non Python packages installed.

Pure "pypi" environment with non wheel files
--conda-pkg git-lfs
# Needs LFS to build
transnetv2 @ git+https://github.com/soCzech/TransNetV2.git#main
# GIT repo
clip @ git+https://github.com/openai/CLIP.git@d50d76daa670286dd6cacf3bcd80b5e4823fc8e1
# Source only distribution
outlier-detector==0.0.3
# Local package
foo @ file:///tmp/build_foo_pkg

The above requirements.txt shows that it is possible to specify repositories directly. Note that this does not work cross platform. Behind the scenes, Metaflow will build wheel packages and cache them.

Pypi + Conda packages
dependencies:
  - pandas = >=1.0.0
  - pip:
    - tensorflow = 2.7.4
    - apache-airflow[aiobotocore]

The above environment.yml shows that it is possible to mix and match pip and conda packages. You can specify packages using "extras" but you cannot, in this form, specify pip packages that come from git repositories or from your local file-system. Pypi packages that are available as wheels or source tar balls are acceptable.

General environment restrictions

In general, the following restrictions are applicable:

Additional documentation

For additional documentation, please refer to the documentation which contains more detailed documentation.

Technical details

This section dives a bit more in the technical aspects of this implementation.

General Concepts

Environments

An environment can either be un-resolved or resolved. An un-resolved environment is simply defined by the set of high-level user-requirements that the environment must satisfy. Typically, this is a list of Conda and/or Pypi packages and version constraints on them. In our case, we also include the set of channels (Conda) or sources (Pip). A resolved environment contains the concrete list of packages that are to be installed to meet the aforementioned requirements. In a resolved environment, all packages are pinned to a single unique version.

In Metaflow, two hashes identify environments and EnvID (from env_descr.py) encapsulates these hashes:

We also associate the architecture for which the environment was resolved to form the complete EnvID.

Environments are named as metaflow_<req_id>_<full_id>. Note that environments that are resolved versions of the same un-resolved environment therefore have the same prefix.

Overview of the phases needed to execute a task in a Conda environment

This implementation of Conda clearly separates out the phases needed to execute a Metaflow task in a Conda environment:

The actual work is all handled in the conda.py file which contains the crux of the logic.

Detailed description of the phases
Resolving environments

All environments are resolved in parallel and independently. To do so, we either use conda-lock or mamba/conda using the --dry-run option. The processing for this takes place in resolve_environment in the conda.py file.

The input to this step is a set of user-level requirements and the output is a set of ResolvedEnvironment. At this point, no package has been downloaded and the ResolvedEnvironment is most likely missing any information about caching.

Caching environments

The cache_environments method in the conda.py file implements this.

There are several steps here. We perform these steps for all resolved environments that need their cache information updated at once to be able to exploit the fact that several environments may refer to the same package:

The ResolvedEnvironment, now with updated cache information, is also cached to S3/azure/gs to promote sharing.

Creating environments

This is the easiest step of all and simply consists of fetching all packages (again using the lazy_download_packages method which will not download any package that is already present) and then using micromamba (or mamba/conda) to simply install all packages.

Detailed information about caching

There are two main things that are cached:

There are also two levels of caching:

Debug

This extension allows user's to seamlessly debug their executed steps in an isolated Jupyter notebook instance with appropriate dependencies by leveraging the conda extension described above (note, this extension currently only works with the version of Conda in this package).

Executing the command

Let's say you have a step called fit_gbrt_for_given_param in your flow, and on executing it, the pathspec for this step/task is HousePricePredictionFlow/1199/fit_gbrt_for_given_param/150671013. To debug this step, you can run the command:

metaflow debug task <HousePricePredictionFlow/1199/fit_gbrt_for_given_param/150671013> --metaflow-root-dir ~/notebooks/debug_task`

Note that you can specify a partial pathspec as long as it can be resolved to a unique task:

Using the extension

Running the above command will:

It will additionally generate a notebook in the defined directory where you can debug the execution of your step line by line. For the given step definition:

@step
def fit_gbrt_for_given_param(self):
    """
    Fit GBRT with given parameters
    """

    from sklearn import ensemble
    from sklearn.model_selection import cross_val_score
    import numpy as np

    estimator = ensemble.GradientBoostingRegressor( n_estimators = self.input['n_estimators'], learning_rate = self.input['learning_rate'],
        max_depth = self.input['max_depth'], min_samples_split = 2, loss = 'ls')

    estimator.fit(self.features, self.labels)

    mses = cross_val_score(estimator, self.features, self.labels, cv = 5, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-mses).mean()

    self.fit = dict(
        index=int(self.index),
        params=self.input,
        rmse=rmse,
        estimator=estimator
    )

    self.next(self.select_best_model)

You will be able to access the artifacts/inputs in your generated notebook directly:

>>> print(self.input['n_estimators'])  # You can access objects using `self` as we imported a stub for it in the notebook
>>> print(self.input['learning_rate'])

You can also execute the whole function again:

>>> from sklearn import ensemble  # imports work seamlessly due to conda extension
>>> from sklearn.model_selection import cross_val_score
>>> import numpy as np
>>> estimator = ensemble.GradientBoostingRegressor( n_estimators = self.input['n_estimators'], learning_rate = self.input['learning_rate'],
    max_depth = self.input['max_depth'], min_samples_split = 2, loss = 'ls')
>>> estimator.fit(self.features, self.labels)
>>> mses = cross_val_score(estimator, self.features, self.labels, cv = 5, scoring='neg_mean_squared_error')
>>> rmse = np.sqrt(-mses).mean()
>>> self.fit = dict(
    index=int(self.index),
    params=self.input,
    rmse=rmse,
    estimator=estimator
)

You can examine the effects of other hyper-parameters live by modifying the min_samples_split = 3 and re-executing the steps on the same data.