Launch Notebooks on AWS with 1-click

amit1rrr commented 5 years ago

Problem

Reproducibility is a sore problem in data science. Simply put, given an experiment/analysis how can someone else quickly rerun it. Reruns can be for verification, to see intermediate states, modify some parameters or to simply rerun the analysis with up-to-date data. Reruns should be easy, right? Not quite. Here are the challenges:

One needs to setup the exact same environment again. These include dependent packages, python versions, environment variables, data files etc. Notebooks don't capture environment information anywhere.
Resource intensive long running scripts require powerful machines in the cloud (e.g. GPU). It's time consuming to manually set up the environment on these machines every time you want to run something. Often data scientist have to coordinate with DevOps folks to help them with infra, resulting in coordination delays and time cost.

Solution

What if users can codify environment information only once and then launch any experiment/analysis with 1-click.

What you do

Specify environment once in the form of Dockerfile/requirements.txt in the repo.
Create & specify AWS API keys to launch EC2 instances in your own account

What you get

You can launch any Notebook (or entire repository) on EC2 instance type of your choice with a single click.
You can make modifications to Notebooks, run them on EC2, and share access with your team members to see results etc.
You get notified via email when a long running notebook cell finishes execution
These 1-click launch buttons/URLs can be shared anywhere e.g. slack, email, internal documentation, GitHub readme etc.
These 1-click URLs will specify repo's branch/commit/PR so the links work as intended even though the repo content evolves over time.

Use cases

Hypothetical scenarios to give you a flavour of what's possible with this feature. Not a comprehensive list.

You want to run your experiment on a beefy machine in the cloud & be notified when the results arrive
You want to run analysis created by your teammate & build on top of it
You want to share weekly engagement metrics in the form of Notebook that anyone in the company can run (Excel, PDFs become stale from the moment they are sent out)
There is an open source implementation of random forest algorithm on GitHub that you want to try out on your own dataset
You want to review a pull request by actually trying out the proposed changes
You want to enable non technical people in the organisation to interact with data directly (via visualisations, reports etc.)

All of the above are possible with just clicking around the UI once the Notebook repositories are setup with the environment config.

Please see FAQ below.

Feel free to upvote/downvote the issue indicating whether you think this is useful feature or not. I also welcome additional questions/comments/discussion on the issue.

sebinsua commented 5 years ago

It's a nice idea but what about other cloud providers? For example, the company I'm at uses Azure.

Will this be specific to AWS?

amit1rrr commented 5 years ago

FAQs

What about other cloud providers?

We are starting with AWS and will extend support for GCP, Azure over time.

Doesn't BinderHub already solves this?

To launch private repos, you have to maintain your own BinderHub (non-trivial)
Can't choose instance type per launch
Private access tokens are shared between all users of BinderHub. See warning

jason-curtis commented 5 years ago

Another similar product you may want to check for inspiration is Google's Colaboratory. I'm not sure what options they have in terms of instance types or provisioning (though I just checked that you can run !pip install to install pip packages). Naturally, their integration with Google Drive appears to be pretty baked-in, which could be a pro or a con depending on your organization!

lamberta commented 5 years ago

Google Colab integrates with GitHub (as well as Drive) and is how it's used on tensorflow.org. For example, this notebook lives in Github, pass the GitHub path as part of the URL directly to Colab: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/basic_classification.ipynb You must log in to run the notebook and your account basically gets a container to install packages, etc.

amit1rrr commented 5 years ago

@thatneat @lamberta Thanks for the note about Google Colab. From what I heard there are couple of limitations,

Accessing notebooks in private repo is not that straightforward. Reference
Can't really choose instance size/type (I guess it's GPU or normal hardware)
All your private data, model, code runs on Google's Colab infrastructure.

gramhagen commented 5 years ago

Azure notebooks may be worth considering, also leveraging Azure devops for build pipelines is really nice. I understand you're planning to look at other cloud providers later. But imo Azure might be able to get you to a mvp faster.

ReviewNB / support