VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

Optionally create a pyenv or similar associated with a vizierdb #58

Closed okennedy closed 1 year ago

okennedy commented 3 years ago

A few ways to pull this off, but the simplest would be to have some sort of settings pane in the UI that the user can toggle to spin up a pyenv in the vizier.db folder. From here, each notebook could have an associated set of python packages.

okennedy commented 3 years ago

Another slightly simpler version would be to create a vizierwide pyenv instance.

okennedy commented 3 years ago

Does PIP have some sort of local packages directory.

okennedy commented 3 years ago

It does: https://docs.python.org/3/library/venv.html

okennedy commented 2 years ago

This project: https://github.com/pathbird/poetry-kernel ... is using this tool: https://python-poetry.org/docs/#installation ... to do something similar

okennedy commented 2 years ago

Also see #13

okennedy commented 2 years ago

Possibly related: https://man.sr.ht/~cnx/ipwhl/

looks like a tool for registering python configurations with ipfs. Probably overkill here... but might be some ideas there to learn from.

okennedy commented 2 years ago

Blocked on #164

okennedy commented 2 years ago

Starting to work out the infrastructure for this...

Flagging @mrb24 @lordpretzel for discussion

Pyenv vs Venv

Pyenv provides an easy way to switch between different python versions. We definitely want to lock in a specific python version, but at the same time, the big problem that this issue is meant to solve is dependency management. For that, it might be easier to use python's venv library to create a directory, and create virtual environments in there.

Playing nice with ScalaPy (#195)

This is another consideration: pyenv apparently defaults to using statically linked python libraries, which breaks compatibility with ScalaPy. There's a workaround here although the authors note that pyenv isn't really meant for server use. It might be easier to just rely on creating virtual environments using the available system binaries?

Import

venv stores host-specific metadata (e.g., a symbolic link to the python executable). This means we probably shouldn't try to store it in the vizier.db folder, which in principle (modulo user-provided paths in e.g., load/unload dataset cells) should be host-agnostic. This suggests to me that we shouldn't store the virtual environments in vizier.db:

A deeper question is what happens if the virtual environment isn't initialized properly or doesn't exist. The obvious idea is to store metadata about the venvs in the catalog:

  1. Which python version / distro (e.g., 3.10.4 / cython),
  2. Which packages were installed / pinned

This allows us to re-create the virtual environment if necessary:

Forcing virtual environments?

On that note, there's a question of how forceful we want to be with virtual environments. If we want Vizier to support reproducibility, we really want everything to be handled through them... but on the other hand this leads to more steps to get python cells working (which would create frustration for users). A few ideas:

PySpark

PySpark is ginormous (a few hundred MB)... it includes a full spark distro and we don't want to pollute the user's HDD with multiple copies of this dependency. That said, there's not a ton out of pyspark that we really need, since the core functionality already exists in Vizier. One idea would be to use ScalaPy to invoke spark via Scala objects... I think the only other place that ScalaPy shows up is in generating cloudpickles of Python functions for use as Spark UDFs, but we could probably rip the cloudpickle dependency out.

okennedy commented 2 years ago

Another point of discussion last week was that changes to pyenv could break existing workflows that rely on it. Concretely

The point is moot if the environment being modified is unused (and we can track this). However, we need to be able to handle situations where someone wants to modify an environment without going through the trouble of duplicating it, etc... After a lengthy discussion with @lordpretzel and @mrb24 :

  1. Warn the user before they take an unsafe action on an environment in active use.
  2. Maintain versioned "snapshots" of environments. Only the most recent snapshot need be materialized, but we can retain a history of old versions still in use.
  3. When a python cell runs, record the "version" of each environment in which it runs.
  4. When a new snapshot is created, dup the workflow with all existing references to the stale environment flagged as INCONSISTENT.
  5. If a notebook contains INCONSISTENT cells, have the UI:
    • Offer to re-run INCONSISTENT cells
    • Offer to fix the issue (e.g., by forking the stale environment)
okennedy commented 1 year ago

244 will close this