elyra-ai / elyra

Elyra extends JupyterLab with an AI centric approach.
https://elyra.readthedocs.io/en/stable/
Apache License 2.0
1.82k stars 343 forks source link

Handle notebook dependencies separately for reproducibility before creating an AI pipeline #1001

Open pacospace opened 3 years ago

pacospace commented 3 years ago

As Data Scientist/Developer,

I want to use GitOps and have a Git repository where all my work is stored and can be reused. I would like to have a structured project so that everyone can find immediately all material used in my project. Therefore I would start my project reusing some templates like [1].

Once I have my project, using Elyra I can clone my repo and start working on my ML project. I would like to have reproducibility and traceability of my notebooks and material and I want to have a way to handle dependencies in my notebook, therefore I could use a jupyter extension for dependencies [2], that could also rely on AI support [3] for enhancements.

Having such an extension would allow the Data Scientist to work on the notebook and make all the experiments there before pushing everything and before creating AI pipelines.

I could use context directories for my notebooks, so that I can create images out of this context directories, each of them with different requirements and tag, each to be step in my AI pipeline that might have different resource requirements. Those images/context could be also created as template notebooks/images that can be reused and made available.

References:

lresende commented 3 years ago

Just to clarify, this would be only in the context of executing a notebook as part of an AI pipeline, where if detected that there is a requirements.txt together with the notebook that would be used to install the dependencies listed there.

Supporting this on a local env would require a lot more thought and we probably would like to introduce the notion of projects to Elyra and make this a feature that requires the user to enabled and configure things like which env manager to use, where the binary is located, etc...

pacospace commented 3 years ago

Just to clarify, this would be only in the context of executing a notebook as part of an AI pipeline, where if detected that there is a requirements.txt together with the notebook that would be used to install the dependencies listed there.

In the AI pipeline, you need to select the image to be used to run the specific step in the pipeline right?

I was thinking if I start a notebook through Elyra that I still don't want to run in a pipeline because it is not ready, I want to have requirements.txt or Pipfile/Pipfile.lock that are created when I install something in a notebook so that I can push to Git all material that can be reused by another person cloning the repo maybe.

And if someone else restarts the notebook, can detect dependencies used to run that specific notebook or if each notebook would have its dependencies and images can be created out of those notebook-dependencies to be used in different steps in AI pipeline.

Supporting this on a local env would require a lot more thought and we probably would like to introduce the notion of projects to Elyra and make this a feature that requires the user to enabled and configures things like which env manager to use, where the binary is located, etc...

By projects you mean instantiate some standard template for AI project?

Is this a feature that would apply per notebook? Each notebook doing different tasks might have different dependencies requirements.

What about a library that can handle all env managers? [4]

[4] https://github.com/thoth-station/micropipenv

akchinSTC commented 3 years ago

Is this a feature that would apply per notebook? Each notebook doing different tasks might have different dependencies requirements.

When running pipelines in Elyra in 'local' mode, the python dependencies will be installed on the host system. So we will need some way to isolate requirements on a per notebook basis. Ideally we would create ephemeral python environment per notebook node ether with pip or conda.

kevin-bates commented 3 years ago

Ideally we would create ephemeral python environment per notebook node ether with pip or conda.

Heads up. This will break the Enterprise Gateway support added in #983 since its predicated on the fact that the node is running within the server's process - similar to how each notebook runs outside of the pipeline context today.

akchinSTC commented 3 years ago

Ideally we would create ephemeral python environment per notebook node ether with pip or conda.

Heads up. This will break the Enterprise Gateway support added in #983 since its predicated on the fact that the node is running within the server's process - similar to how each notebook runs outside of the pipeline context today.

Dang, forgot about the EG enhancements. Will need to think on this more.

kevin-bates commented 3 years ago

We could probably remove that restriction, and I'd view that as the long-range plan, but it would mean a heck of a lot more to configure besides just the gateway URL. I suppose the configuration could be extracted and made available to the new env, but just wanted to point out there's some sensitivity there.

pacospace commented 3 years ago

Is this a feature that would apply per notebook? Each notebook doing different tasks might have different dependencies requirements.

When running pipelines in Elyra in 'local' mode, the python dependencies will be installed on the host system. So we will need some way to isolate requirements on a per notebook basis. Ideally we would create ephemeral python environment per notebook node ether with pip or conda.

So every notebook or notebooks should be shipped with requirements.txt or Pipfile/Pipfile.lock so anyone can install and re run the notebook/s with the same environment.

Having in a notebook cell:

! pip install tensorflow
! pip install boto3
! pip install matplotlib

does not guarantee reproducibility.

Creating a virtualenv from dependencies files so the kernel of the notebook can use specific requirements?

Ore using jupyter notebook extension that manages dependencies and creates requirements file from the notebook and use them for the kernel:

ui-jnbreq

these requirements would be created by the Jupyter nberequirements extension: Screenshot from 2020-10-28 10-09-30

if you restart the notebook, you can run the detect button, to find out the dependencies required as well or using some notebook magic commands like %requirements install, ensure, clear, etc.: Screenshot from 2020-04-15 08-01-27 (2)

akchinSTC commented 3 years ago

@pacospace - Would you be able to attend our weekly dev meeting tomorrow 10/29 @ 9am PDT? It would be great to hash out all the details of this issue in a conf call. If not, could you let us know what timezone you are in and we can get a group chat going in our gitter channel?

Weekly dev meeting details: https://hackmd.io/SgvSqrWWR2248mCw2BZ5gg?both gitter channel : https://gitter.im/elyra-ai/community

pacospace commented 3 years ago

@pacospace - Would you be able to attend our weekly dev meeting tomorrow 10/29 @ 9am PDT? It would be great to hash out all the details of this issue in a conf call. If not, could you let us know what timezone you are in and we can get a group chat going in our gitter channel?

Weekly dev meeting details: https://hackmd.io/SgvSqrWWR2248mCw2BZ5gg?both gitter channel : https://gitter.im/elyra-ai/community

@akchinSTC I'm in CET and the meeting time should be feasible! Very happy to join!

lresende commented 3 years ago

So, assuming we have nbRequirements on the pipeline execution runtime where the notebook is running https://github.com/elyra-ai/kfp-notebook/pull/62 and that the user, at least for now, includes the files like requirements.txt or Pipfile/Pipfile.lock are we ok? We could subsequently add support for auto-include these extra files semi-automatically.

pacospace commented 3 years ago

@akchinSTC @lresende https://www.youtube.com/watch?v=HUp2JARu6fw here is the video/demo. Thanks for the nice chat!

akchinSTC commented 3 years ago

@akchinSTC @lresende https://www.youtube.com/watch?v=HUp2JARu6fw here is the video/demo. Thanks for the nice chat!

Thank you for joining us and giving that demo on such short notice. It definitely helped clear up a few things. I think we have a few things to work off of/talking points now based on the chat.

pacospace commented 3 years ago

@akchinSTC @lresende https://www.youtube.com/watch?v=HUp2JARu6fw here is the video/demo. Thanks for the nice chat!

Thank you for joining us and giving that demo on such short notice. It definitely helped clear up a few things. I think we have a few things to work off of/talking points now based on the chat.

Thanks @akchinSTC

  • nbrequirements extension Integration with JupyterLab

We started working on that!

  • prototype a pipeline with s2i, nbrequirements/pip requirements and elyra

As part of our roadmap in Project Thoth: https://github.com/thoth-station/core/blob/master/docs/ROADMAP.md#pipelines-for-reproducible-builds. @harshad16 can show you what we have and use already. Let us know when you want to have another demo :)

pacospace commented 3 years ago

https://github.com/thoth-station/core/blob/master/docs/ROADMAP.md#jupyter-requirements-management

pacospace commented 3 years ago

@akchinSTC @lresende https://www.youtube.com/watch?v=IBzTOP4TCdA

akchinSTC commented 3 years ago

@pacospace - thanks paco, checking it out now!

pacospace commented 3 years ago

@pacospace - thanks paco, checking it out now!

We created a public channel to interact with Thoth team as well: https://github.com/thoth-station/core#interact-with-thoth-team!

pacospace commented 3 years ago

Screenshot from 2021-02-08 18-50-01

@akchinSTC

pacospace commented 3 years ago

@akchinSTC @lresende https://www.youtube.com/watch?v=-_dtDAAyMlU&t=190s release of v0.3.6

akchinSTC commented 3 years ago

@pacospace - thanks for all the input and work and most importantly patience around this Francesco! been wrapping up work around the new airflow feature and will be catching up with this issue today

pacospace commented 3 years ago

@pacospace - thanks for all the input and work and most importantly patience around this Francesco! been wrapping up work around the new airflow feature and will be catching up with this issue today

No problem @akchinSTC :) when you want we can talk about jupyterlab-requirements [5] extension for jupyerlab. Happy to join on a call and show you what it does live and we can discuss about it.

akchinSTC commented 3 years ago

@pacospace, im trying out the lab requirements extension as we speak, nice... avoiding the subject of local execution for a minute, @lresende I think we will need to formalize our file and directory structure when using elyra pipelines, given that the overlays directory is generated in the root directory and needs to be scoped properly. https://github.com/aicoe-aiops/project-template is a really good start but I think we would want to make the default template a little more generic? or maybe provide the user with the option to select from a list of template structures to choose from to start.

pacospace commented 3 years ago

Since v0.10.0 we introduced%horus magic commands to handle dependencies directly from notebook cells. Check out the guide here or the video here cc @akchinSTC