SAME-Project / same-project

https://sameproject.ml/
Apache License 2.0
19 stars 8 forks source link

SAME Project: Initial Roadmap #7

Open rynowak opened 3 years ago

rynowak commented 3 years ago

SAME Project: Initial Roadmap

This is a roadmap document for SAME, tracking the investments we need to make during the early phases. The goal right now is to provide clarity for the execution order of features/tasks for the immediate future, and in some cases beyond. You won’t find detailed descriptions of each item here – this is a stakeholder view.

Scenarios

S1: If it works on my machine, it works in production

A data-scientist downloads a notebook from a public github repo on the web. The notebook contains about 10 cells using standard Python ML and data-cleaning libraries, but following the common practice, the imports of libraries are spread across different cells and expect the libraries to already be installed in the environment. She is able to run the notebook locally because she already has these libraries, but it fails in production due to missing dependencies.

She tries Project SAME to fix this. She does a same program run at the command line and the CLI halts and shows a warning describing the problem with each import <some library> in the notebook. The CLI also outputs a line of code that uses same.import(...) to import the libraries. She then replaces all of the import <some library> lines with the single snippet of code that was recommended.

Now that she's using same.import(...). She can use same program run to the command line and the CLI will generate/update a conda.yaml that captures the set of imports and the required Python version. This is much easier because she can add additional references in same.import(...) and the conda.yaml will be automatically maintained for her. This works well because conda.yaml is a standard way of managing dependencies and it works natively with development environments like VSCode and Jupyter. Note: We are using the conda.yaml format, which is independent of the conda virtual environment provider.

Now she can also use SAME to run the notebook against a production system like Airflow, Azure Machine Learning, or Kubeflow, and SAME will be able to automatically containerize her steps for. She not have to manually update a list of libraries or write a Dockerfile because SAME will keep the dependencies up to date for her.

S2: I can easily parameterize data-sets using SAME

A data-scientist downloads a notebook from a public github repo on the web. This notebook is using a .csv file checked in to the repository as a data source and loading it using pandas. When she tries to run the notebook locally, it fails because the hardcoded path to load the .csv file doesn't match her local setup. She would be able to edit this path, to point to the location of her data file, but worries that the code will get messy when she needs to run this in production against a file in an S3 bucket.

She decides to use the SAME SDK to move the location of the data into configuration. She creates a SAME.yaml for the notebook and adds a named dataset (bird_data) to the configuration. For the local environment she configures the dataset to use the relative path ./bird_data.csv. Then she replaces the hardcoded path in the notebook with a call to same.data_set('bird_data'). This is an easy change to make because she can use the result of this call to with pandas, or any other standard Python data-science library. When she runs the notebook locally with same program run, the CLI can parse the same.yaml and check that the file is present - this would help is someone else were trying to run her notebook on an other computer later.

Next she wants to do a dry-run against Kubeflow using her test data to make sure the notebook will work there. To do this she adds a new environment to her same.yaml the describes how to connect to Kubeflow, and configures it to use the same local file. When she deploys the notebook to a production system with same program run, the local file (bird_data.csv) will be included in the container.

Once she does a dry-run against Kubeflow with her test data, she wants to run using the production data stored in an S3 bucket and protected by an access key. To make this change she needs to update her same.yaml for the Kubeflow environment to refer to the S3 bucket. She's able to pass the access key in to same program run as a command line parameter to avoid checking in a secret value. Using same to load the dataset and parameterize for development and production has been really easy because she only had to make one code change to get started.

Open Questions:

Roadmap

Items are listed in rough priority buckets. The team’s execution is milestone-based while milestones being roughly one month long in the normal case. We reserve the right to make an individual milestone differ from this schedule when necessary or convenient.

Milestone 1: Parity

Goal: Bootstrap project repo (this repo), docs site, and infrastructure for build.

Goal: Reach partial parity with codebase at https://github.com/azure-octo/same-cli/ (but use python this time)

Items:

Milestone 2: SAME import scenario

Goal: Build out the first e2e scenario that relies on same.import(...)

Items:

Milestone 3: SAME data_set scenario

Goal: Build out the second e2e scenario that relies on same.data_set(...)

Items:

gohar94 commented 3 years ago

My two cents:

  1. Having the user re-define imports could be less than ideal user experience. Either we should just infer the imports and do the translation into same.import implicitly and never expose that in the user code (i.e., keep it unmodified) or we should make those code changes, show them to the user (perhaps like git merge conflicts show up in VSCode) and let the user accept/reject the changes.
  2. For "external" resources touched by the user code (like files etc.) can we infer them using some static analysis/parsing and try to upload them to cloud environments like Kubeflow without code changes by the user? In this case the user would just keep the notebook/code as it is and the framework would take care of translations across environments.
  3. Can we add Azure Functions as an execution backend to our road map? (Of course, given we have resources to develop/test that; I can help with this effort.)
  4. Longer term goal: it might be interesting to decide what the appropriate computation backend for a given cell/part-of-cell/notebook is without having the user decide or maybe even know.
  5. Longer term goal: if we can do dependency analysis on the code and among cells, maybe we can run some cells in parallel with other cells (perhaps even on different backends), do out of order execution, and preemptively run some cells and if things change we can throw away the results etc.
aronchick commented 3 years ago
rynowak commented 3 years ago

Yes please to 5. The potential of that is HUGE - both to identify problems and to optimize workloads based on telemetry.

aronchick commented 2 years ago

Progress! We've reached parity with the old system - we now support: