Open rynowak opened 3 years ago
My two cents:
same.import
implicitly and never expose that in the user code (i.e., keep it unmodified) or we should make those code changes, show them to the user (perhaps like git merge conflicts show up in VSCode) and let the user accept/reject the changes.if os.environs.get("ENV") == "development": xxxx elif if os.environs.get("ENV") == "staging": yyyy
- if they do it at all. By making this first class, and having external configuration, it cleans things up dramatically.Yes please to 5. The potential of that is HUGE - both to identify problems and to optimize workloads based on telemetry.
Progress! We've reached parity with the old system - we now support:
SAME Project: Initial Roadmap
This is a roadmap document for SAME, tracking the investments we need to make during the early phases. The goal right now is to provide clarity for the execution order of features/tasks for the immediate future, and in some cases beyond. You won’t find detailed descriptions of each item here – this is a stakeholder view.
Scenarios
S1: If it works on my machine, it works in production
A data-scientist downloads a notebook from a public github repo on the web. The notebook contains about 10 cells using standard Python ML and data-cleaning libraries, but following the common practice, the imports of libraries are spread across different cells and expect the libraries to already be installed in the environment. She is able to run the notebook locally because she already has these libraries, but it fails in production due to missing dependencies.
She tries Project SAME to fix this. She does a
same program run
at the command line and the CLI halts and shows a warning describing the problem with eachimport <some library>
in the notebook. The CLI also outputs a line of code that usessame.import(...)
to import the libraries. She then replaces all of theimport <some library>
lines with the single snippet of code that was recommended.Now that she's using
same.import(...)
. She can usesame program run
to the command line and the CLI will generate/update aconda.yaml
that captures the set of imports and the required Python version. This is much easier because she can add additional references insame.import(...)
and theconda.yaml
will be automatically maintained for her. This works well becauseconda.yaml
is a standard way of managing dependencies and it works natively with development environments like VSCode and Jupyter. Note: We are using the conda.yaml format, which is independent of the conda virtual environment provider.Now she can also use SAME to run the notebook against a production system like Airflow, Azure Machine Learning, or Kubeflow, and SAME will be able to automatically containerize her steps for. She not have to manually update a list of libraries or write a Dockerfile because SAME will keep the dependencies up to date for her.
S2: I can easily parameterize data-sets using SAME
A data-scientist downloads a notebook from a public github repo on the web. This notebook is using a
.csv
file checked in to the repository as a data source and loading it using pandas. When she tries to run the notebook locally, it fails because the hardcoded path to load the.csv
file doesn't match her local setup. She would be able to edit this path, to point to the location of her data file, but worries that the code will get messy when she needs to run this in production against a file in an S3 bucket.She decides to use the SAME SDK to move the location of the data into configuration. She creates a SAME.yaml for the notebook and adds a named dataset (
bird_data
) to the configuration. For the local environment she configures the dataset to use the relative path./bird_data.csv
. Then she replaces the hardcoded path in the notebook with a call tosame.data_set('bird_data')
. This is an easy change to make because she can use the result of this call to with pandas, or any other standard Python data-science library. When she runs the notebook locally withsame program run
, the CLI can parse thesame.yaml
and check that the file is present - this would help is someone else were trying to run her notebook on an other computer later.Next she wants to do a dry-run against Kubeflow using her test data to make sure the notebook will work there. To do this she adds a new environment to her
same.yaml
the describes how to connect to Kubeflow, and configures it to use the same local file. When she deploys the notebook to a production system withsame program run
, the local file (bird_data.csv
) will be included in the container.Once she does a dry-run against Kubeflow with her test data, she wants to run using the production data stored in an S3 bucket and protected by an access key. To make this change she needs to update her
same.yaml
for the Kubeflow environment to refer to the S3 bucket. She's able to pass the access key in tosame program run
as a command line parameter to avoid checking in a secret value. Using same to load the dataset and parameterize for development and production has been really easy because she only had to make one code change to get started.Open Questions:
same program run
for imports? It might be tedious to replace all of theimport <library>
statements yourself. Can we do better?same.import(...)
work gracefully with editor tooling? You will likely lose completion insidesame.import
that you would have in other contexts.same.yaml
? I assume we should.Roadmap
Items are listed in rough priority buckets. The team’s execution is milestone-based while milestones being roughly one month long in the normal case. We reserve the right to make an individual milestone differ from this schedule when necessary or convenient.
Milestone 1: Parity
Goal: Bootstrap project repo (this repo), docs site, and infrastructure for build.
Goal: Reach partial parity with codebase at https://github.com/azure-octo/same-cli/ (but use python this time)
Items:
Repo setup
Parity
.ipynb
file to plain pythonMilestone 2: SAME import scenario
Goal: Build out the first e2e scenario that relies on
same.import(...)
same.import(...)
needs to work without significant drawbacks for the inner-loopsame.import(...)
linting and adoption needs to be smoothItems:
conda.yaml
into our containerization supportsame program run
to updateconda.yaml
? What triggers an update?same.import(...)
? Local Python modules? Python built-in packages?Milestone 3: SAME data_set scenario
Goal: Build out the second e2e scenario that relies on
same.data_set(...)
same.data_set(...)
and simplify parameterization of data sets including remote and authenticated sourcesItems:
SAME.yaml
same.data_set(...)
import <library>
as part ofsame program run
!pip ...
as part ofsame program run
conda.yaml
fromsame.import(...)
conda.yaml
into our containerization supportsame program run
to updateconda.yaml
? What triggers an update?same.import(...)
? Local Python modules? Python built-in packages?