Open amit1rrr opened 5 years ago
It's a nice idea but what about other cloud providers? For example, the company I'm at uses Azure.
Will this be specific to AWS?
What about other cloud providers?
Doesn't BinderHub already solves this?
Another similar product you may want to check for inspiration is Google's Colaboratory. I'm not sure what options they have in terms of instance types or provisioning (though I just checked that you can run !pip install
to install pip packages). Naturally, their integration with Google Drive appears to be pretty baked-in, which could be a pro or a con depending on your organization!
Google Colab integrates with GitHub (as well as Drive) and is how it's used on tensorflow.org. For example, this notebook lives in Github, pass the GitHub path as part of the URL directly to Colab: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/basic_classification.ipynb You must log in to run the notebook and your account basically gets a container to install packages, etc.
@thatneat @lamberta Thanks for the note about Google Colab. From what I heard there are couple of limitations,
Azure notebooks may be worth considering, also leveraging Azure devops for build pipelines is really nice. I understand you're planning to look at other cloud providers later. But imo Azure might be able to get you to a mvp faster.
Problem
Reproducibility is a sore problem in data science. Simply put, given an experiment/analysis how can someone else quickly rerun it. Reruns can be for verification, to see intermediate states, modify some parameters or to simply rerun the analysis with up-to-date data. Reruns should be easy, right? Not quite. Here are the challenges:
One needs to setup the exact same environment again. These include dependent packages, python versions, environment variables, data files etc. Notebooks don't capture environment information anywhere.
Resource intensive long running scripts require powerful machines in the cloud (e.g. GPU). It's time consuming to manually set up the environment on these machines every time you want to run something. Often data scientist have to coordinate with DevOps folks to help them with infra, resulting in coordination delays and time cost.
Solution
What if users can codify environment information only once and then launch any experiment/analysis with 1-click.
What you do
Dockerfile
/requirements.txt
in the repo.What you get
Use cases
Hypothetical scenarios to give you a flavour of what's possible with this feature. Not a comprehensive list.
All of the above are possible with just clicking around the UI once the Notebook repositories are setup with the environment config.
Please see FAQ below.
Feel free to upvote/downvote the issue indicating whether you think this is useful feature or not. I also welcome additional questions/comments/discussion on the issue.