eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
137 stars 19 forks source link

Managing dependencies for different models #475

Open changhiskhan opened 2 years ago

changhiskhan commented 2 years ago

Motivation:

different models often are developed using different dependencies/versions that may conflict with each other. It would be difficult to create a single environment that can satisfy all models

One possible option is to extend the Spark SQL syntax to support creating environments.

  1. Add a sql statement to create environments:

    CREATE ENVIRONMENT
    type='conda'
    name='env_for_fasterrcnn_resnet50_fpn`
    requirements='url-to-requirements.txt'
    preload=true
  2. Environments are registered in an EnvironmentRegistry which is accessible like:

    SHOW ENVIRONMENTS

    This can be configured in Spark conf.

  3. Then use the environment name in CREATE MODEL

    CREATE MODEL
    ...
    environment='env_for_fasterrcnn_resnet50_fpn'
    ...
  4. Conda is deployed on each spark worker.

  5. In Rikai where it looks for the right python executable, we can use the environment name to identify the right python executable for the environment. If that environment doesn't exist, look it up in the registry and create it.

  6. If preload=True was specified in the environment, then when the cluster starts up automatically create the environment to save time for later.

  7. We then also think about supporting type=docker for the environment and instead of a requirements file, give either a Dockerfile OR the image url. That being said, conda does support non-python dependencies so before we go full Docker route we can check to see if it's possible to create conda packages for the non-python dependencies we'll need

changhiskhan commented 2 years ago

Mlflow supports additional logging for conda.yaml and requirements.txt (https://www.mlflow.org/docs/latest/models.html#additional-logged-files), so a first cut could just be to use the provided environment in the runner. This way we can have partial support for environment isolation without even needing to extend the SQL syntax

eddyxu commented 2 years ago

In Rikai where it looks for the right python executable, we can use the environment name to identify the right python executable for the environment. If that environment doesn't exist, look it up in the registry and create it.

Is this happen after the spark session / executor is acquired or before it? Will the spark executor have the different python as the "ml runner"?

I can see probably starting another python process in model runner and use IPC between the executor and that "runner process" is feasible.

We then also think about supporting type=docker for the environment and instead of a requirements file, give either a Dockerfile OR the image url.

Thought about this before, it might have some restriction w.r.t where the spark executor can run. If the spark job is already running in docker (i.e., Yarn docker or using k8s scheduler) , this approach then needs nested docker support.

eddyxu commented 2 years ago

Another open question would be, how to support multi-models in the same query.

changhiskhan commented 2 years ago

In Rikai where it looks for the right python executable, we can use the environment name to identify the right python executable for the environment. If that environment doesn't exist, look it up in the registry and create it.

Is this happen after the spark session / executor is acquired or before it? Will the spark executor have the different python as the "ml runner"?

I can see probably starting another python process in model runner and use IPC between the executor and that "runner process" is feasible.

I wonder if we can just use ipython's existing kernel IPC using zmq for this?

changhiskhan commented 2 years ago

We then also think about supporting type=docker for the environment and instead of a requirements file, give either a Dockerfile OR the image url.

Thought about this before, it might have some restriction w.r.t where the spark executor can run. If the spark job is already running in docker (i.e., Yarn docker or using k8s scheduler) , this approach then needs nested docker support.

what if the executor container was able to send something back to Yarn/k8s to start another container and configured it with some IPC between the two containers?