aicoe-aiops / project-template

this is a template to use for new data science projects in the aiops group
Other
8 stars 21 forks source link

Is it possible to have template notebooks for reusable logic? #25

Open pacospace opened 3 years ago

pacospace commented 3 years ago

Related-To: https://github.com/AICoE/idh-manifests/issues/9

MichaelClifford commented 3 years ago

@pacospace :+1: What types of reusable logic should we include in this notebook(s)?

Let's make a comprehensive list here, and we can start to add what we want to the template.

We did start this repo with example notebooks awhile back, but it hasn't seen much use, probably better to shift to include these "example" notebooks into the template as you've suggested.

pacospace commented 3 years ago

@pacospace What types of reusable logic should we include in this notebook(s)?

  • Interacting with Ceph
  • Connecting to Spark
  • Connecting to Prometheus/ Thanos
  • Using GPU's on JupyterHub
  • Plotting style and best practices
  • Managing environments on JupyterHub
  • ?

Let's make a comprehensive list here, and we can start to add what we want to the template.

We did start this repo with example notebooks awhile back, but it hasn't seen much use, probably better to shift to include these "example" notebooks into the template as you've suggested.

what about the naming convention for notebooks? {MLstep}-{distributed-or-not}-{Hardware}-{version}. One of the concerns I have is dependencies because it might become quite a large software stack. As in an ML project, different steps will have different requirements and you might want to keep them separate because maybe some step will require different hardware or technology to run on. WDYT?

durandom commented 3 years ago

naming conventions or annotations are great. can we get those template notebooks to be published on https://github.com/operate-first/operate-first.github.io please? Let's start with those that we have in the template repo.

And may I suggest to create new issues in the template repo for missing templates?

MichaelClifford commented 3 years ago

what about the naming convention for notebooks? {MLstep}-{distributed-or-not}-{Hardware}-{version}.

@pacospace I think defined naming conventions are great. But can this be enforced by github in anyway or would it just exists through our own example notebooks using this convention?

Also what do you mean by the {Hardware} label, like GPU or CPU? And would {version} be the current version of the notebook or like the version of some CUDA driver the notebook needs?

One of the concerns I have is dependencies because it might become quite a large software stack. As in an ML project, different steps will have different requirements and you might want to keep them separate because maybe some step will require different hardware or technology to run on.

Can you clarify this point above? Are you suggesting we keep the notebooks separate? If so, in what way? Do you mean in separate repos? or in separate directories with different pip files?

pacospace commented 3 years ago

what about the naming convention for notebooks? {MLstep}-{distributed-or-not}-{Hardware}-{version}.

@pacospace I think defined naming conventions are great. But can this be enforced by github in anyway or would it just exists through our own example notebooks using this convention?

My thoughts were related to the different images that would be created. Imagine different steps in AI pipeline, they would require different images to be created, therefore the idea could be to have inside notebooks repo, different context directory equivalent to ML context (EDA, etc.. as it is already in https://github.com/aicoe-aiops/data-science-workflow-examples/tree/master/notebooks, we can find more classes and subclasses to store notebooks), as each of the notebook would create basically an image that you can use in one of your step in your AI pipeline (thinking about Elyra on ODH). Each context directory would have different Pipfile, Pipfile.lock so that small services are created basically. We cannot have a single Pipfile and Pipfile.lock that could cover all notebooks requirements, also it is not good to have an image with million dependencies. @goern @harshad16 tagged as I think that is what is done for the different base notebooks for ODH: https://github.com/thoth-station/jupyter-notebooks

Moreover, it would be possible to use nbrequirements extension (https://github.com/thoth-station/jupyter-nbrequirements) for each notebook to manage requirements and store the Pipfile/Pipfile.lock in the context directory where the user is using the notebook.

Also what do you mean by the {Hardware} label, like GPU or CPU? And would {version} be the current version of the notebook or like the version of some CUDA driver the notebook needs?

A version of the notebook actually not required if we create tags out of each context directory, so a different file for the version would be contained in the context directory. For hardware, I mean CPU and GPU actually if there is something different to be stated to use specific hardware. All CUDA requirements would be handled by Thoth logic and you can state that in .thoth.yaml file actually.

One of the concerns I have is dependencies because it might become quite a large software stack. As in an ML project, different steps will have different requirements and you might want to keep them separate because maybe some step will require different hardware or technology to run on.

Can you clarify this point above? Are you suggesting we keep the notebooks separate? If so, in what way? Do you mean in separate repos? or in separate directories with different pip files?

We can use the context directory that can be handled by s2i builds, for example in https://github.com/thoth-station/jupyter-notebooks. (this would be good with Elyra also selecting an image for your step in AI pipeline)

@sophwats @vpavlin @nakfour @goern @harshad16