rynowak commented 3 years ago

SAME Project: Initial Roadmap

This is a roadmap document for SAME, tracking the investments we need to make during the early phases. The goal right now is to provide clarity for the execution order of features/tasks for the immediate future, and in some cases beyond. You won’t find detailed descriptions of each item here – this is a stakeholder view.

Scenarios

S1: If it works on my machine, it works in production

A data-scientist downloads a notebook from a public github repo on the web. The notebook contains about 10 cells using standard Python ML and data-cleaning libraries, but following the common practice, the imports of libraries are spread across different cells and expect the libraries to already be installed in the environment. She is able to run the notebook locally because she already has these libraries, but it fails in production due to missing dependencies.

She tries Project SAME to fix this. She does a same program run at the command line and the CLI halts and shows a warning describing the problem with each import <some library> in the notebook. The CLI also outputs a line of code that uses same.import(...) to import the libraries. She then replaces all of the import <some library> lines with the single snippet of code that was recommended.

Now that she's using same.import(...). She can use same program run to the command line and the CLI will generate/update a conda.yaml that captures the set of imports and the required Python version. This is much easier because she can add additional references in same.import(...) and the conda.yaml will be automatically maintained for her. This works well because conda.yaml is a standard way of managing dependencies and it works natively with development environments like VSCode and Jupyter. Note: We are using the conda.yaml format, which is independent of the conda virtual environment provider.

Now she can also use SAME to run the notebook against a production system like Airflow, Azure Machine Learning, or Kubeflow, and SAME will be able to automatically containerize her steps for. She not have to manually update a list of libraries or write a Dockerfile because SAME will keep the dependencies up to date for her.

S2: I can easily parameterize data-sets using SAME

A data-scientist downloads a notebook from a public github repo on the web. This notebook is using a .csv file checked in to the repository as a data source and loading it using pandas. When she tries to run the notebook locally, it fails because the hardcoded path to load the .csv file doesn't match her local setup. She would be able to edit this path, to point to the location of her data file, but worries that the code will get messy when she needs to run this in production against a file in an S3 bucket.

She decides to use the SAME SDK to move the location of the data into configuration. She creates a SAME.yaml for the notebook and adds a named dataset (bird_data) to the configuration. For the local environment she configures the dataset to use the relative path ./bird_data.csv. Then she replaces the hardcoded path in the notebook with a call to same.data_set('bird_data'). This is an easy change to make because she can use the result of this call to with pandas, or any other standard Python data-science library. When she runs the notebook locally with same program run, the CLI can parse the same.yaml and check that the file is present - this would help is someone else were trying to run her notebook on an other computer later.

Next she wants to do a dry-run against Kubeflow using her test data to make sure the notebook will work there. To do this she adds a new environment to her same.yaml the describes how to connect to Kubeflow, and configures it to use the same local file. When she deploys the notebook to a production system with same program run, the local file (bird_data.csv) will be included in the container.

Once she does a dry-run against Kubeflow with her test data, she wants to run using the production data stored in an S3 bucket and protected by an access key. To make this change she needs to update her same.yaml for the Kubeflow environment to refer to the S3 bucket. She's able to pass the access key in to same program run as a command line parameter to avoid checking in a secret value. Using same to load the dataset and parameterize for development and production has been really easy because she only had to make one code change to get started.

Open Questions:

Do we want to iterate on the workflow of same program run for imports? It might be tedious to replace all of the import <library> statements yourself. Can we do better?
Will same.import(...) work gracefully with editor tooling? You will likely lose completion inside same.import that you would have in other contexts.
Do we scaffold same.yaml? I assume we should.

Roadmap

Items are listed in rough priority buckets. The team’s execution is milestone-based while milestones being roughly one month long in the normal case. We reserve the right to make an individual milestone differ from this schedule when necessary or convenient.

Milestone 1: Parity

Goal: Bootstrap project repo (this repo), docs site, and infrastructure for build.

We plan to scale up the number of contributors and thus need to create a good base for a combined engineering effort.

Goal: Reach partial parity with codebase at https://github.com/azure-octo/same-cli/ (but use python this time)

We plan to re-plat SAME as a purely Python-based project. This will help us align better with the ecosystem and network of contributors. The coupling with ML/Notebook libraries (Python) is stronger than the coupling with cloud-native tech (Go).

Items:

Repo setup
- [x] Testing
- [x] Linting
- [x] CI pipeline
- [x] Coverage and test reporting
- [x] Packaging
- [x] Developer docs
- [ ] Package publishing
Parity
- [x] CLI skeleton
- [x] #12
- [x] #13
- [x] Convert an .ipynb file to plain python
- [x] Templating system for compiling notebooks
- [x] Compilation to supported workflow systems
- [x] Kubeflow
- [x] AML
- [x] Deployment to supported workflow systems
- [x] Kubeflow
- [x] AML
- [x] Docs for basic scenarios
- [x] Docs for setting up Kubeflow

Milestone 2: SAME import scenario

Goal: Build out the first e2e scenario that relies on same.import(...)

Prove-point: same.import(...) needs to work without significant drawbacks for the inner-loop
Prove-point: same.import(...) linting and adoption needs to be smooth

Items:

[ ] #11
[ ] #14
[x] #15
[x] #16
[ ] Integrate conda.yaml into our containerization support
[ ] Investigate: do we need you to do same program run to update conda.yaml? What triggers an update?
[ ] Investigate: what are the edge cases for same.import(...)? Local Python modules? Python built-in packages?

Milestone 3: SAME data_set scenario

Goal: Build out the second e2e scenario that relies on same.data_set(...)

Prove-point: same.data_set(...) and simplify parameterization of data sets including remote and authenticated sources

Items:

[x] Implement data sets as part of SAME.yaml
[ ] Implement same.data_set(...)
[x] Linting for import <library> as part of same program run
[x] Linting for !pip ... as part of same program run
[ ] Generation of conda.yaml from same.import(...)
[ ] Integrate conda.yaml into our containerization support
[ ] Investigate: do we need you to do same program run to update conda.yaml? What triggers an update?
[ ] Investigate: what are the edge cases for same.import(...)? Local Python modules? Python built-in packages?

gohar94 commented 3 years ago

My two cents:

Having the user re-define imports could be less than ideal user experience. Either we should just infer the imports and do the translation into same.import implicitly and never expose that in the user code (i.e., keep it unmodified) or we should make those code changes, show them to the user (perhaps like git merge conflicts show up in VSCode) and let the user accept/reject the changes.
For "external" resources touched by the user code (like files etc.) can we infer them using some static analysis/parsing and try to upload them to cloud environments like Kubeflow without code changes by the user? In this case the user would just keep the notebook/code as it is and the framework would take care of translations across environments.
Can we add Azure Functions as an execution backend to our road map? (Of course, given we have resources to develop/test that; I can help with this effort.)
Longer term goal: it might be interesting to decide what the appropriate computation backend for a given cell/part-of-cell/notebook is without having the user decide or maybe even know.
Longer term goal: if we can do dependency analysis on the code and among cells, maybe we can run some cells in parallel with other cells (perhaps even on different backends), do out of order execution, and preemptively run some cells and if things change we can throw away the results etc.

aronchick commented 3 years ago

1 - I get your point, but it's REALLY hard to infer (there are many edge cases that make things challenging). It's not ideal, but by asking for explicit definition, we save a lot of pain. Perhaps in the future we abstract more?
2 - I'm not sure we can! Today, people do this via complicated control flow - e.g. if os.environs.get("ENV") == "development": xxxx elif if os.environs.get("ENV") == "staging": yyyy - if they do it at all. By making this first class, and having external configuration, it cleans things up dramatically.
3 - ABSOLUTELY. Heck, we could make it first :) We DO need to figure out a unified workflow execution system - logic apps? - that ties all the functions together.
4 - YES. That would be amazing. However, the most significant challenge is that introduces two level orchestration (one thing to orchestrate all the individual pipelines, and one thing to execute each pipeline)
5 - ABSOLUTELY AGAIN. This was something I began to explore with the author of nbsafety (https://github.com/nbsafety-project/nbsafety) - biggest challenge is how do we understand the DAG without executing (because many steps are not idempotent and/or not mocked). Love to explore more!

rynowak commented 3 years ago

Yes please to 5. The potential of that is HUGE - both to identify problems and to optimize workloads based on telemetry.

aronchick commented 2 years ago

Progress! We've reached parity with the old system - we now support:

Kubeflow & AML
Private environments
Single step notebooks (with no metadata)
Parsing code for packages (but not yet syncing that with the conda.yaml file)

SAME-Project / same-project