Closed okennedy closed 1 year ago
Another slightly simpler version would be to create a vizierwide pyenv instance.
Does PIP have some sort of local packages directory.
This project: https://github.com/pathbird/poetry-kernel ... is using this tool: https://python-poetry.org/docs/#installation ... to do something similar
Also see #13
Possibly related: https://man.sr.ht/~cnx/ipwhl/
looks like a tool for registering python configurations with ipfs. Probably overkill here... but might be some ideas there to learn from.
Blocked on #164
Starting to work out the infrastructure for this...
Flagging @mrb24 @lordpretzel for discussion
Pyenv provides an easy way to switch between different python versions. We definitely want to lock in a specific python version, but at the same time, the big problem that this issue is meant to solve is dependency management. For that, it might be easier to use python's venv
library to create a directory, and create virtual environments in there.
This is another consideration: pyenv
apparently defaults to using statically linked python libraries, which breaks compatibility with ScalaPy. There's a workaround here although the authors note that pyenv isn't really meant for server use. It might be easier to just rely on creating virtual environments using the available system binaries?
venv
stores host-specific metadata (e.g., a symbolic link to the python executable). This means we probably shouldn't try to store it in the vizier.db
folder, which in principle (modulo user-provided paths in e.g., load/unload dataset cells) should be host-agnostic. This suggests to me that we shouldn't store the virtual environments in vizier.db
:
vizier.db/../.vizier-cache/python
<- specific project~/.vizier-cache/python
<- user-wideI think tying it to the specific project is better. This prevents naming collisions across projects.
A deeper question is what happens if the virtual environment isn't initialized properly or doesn't exist. The obvious idea is to store metadata about the venvs in the catalog:
This allows us to re-create the virtual environment if necessary:
On that note, there's a question of how forceful we want to be with virtual environments. If we want Vizier to support reproducibility, we really want everything to be handled through them... but on the other hand this leads to more steps to get python cells working (which would create frustration for users). A few ideas:
system
as the default virtual environment, but put some sort of prominent warning on python cells that use this environment they may not be reproducible.PySpark is ginormous (a few hundred MB)... it includes a full spark distro and we don't want to pollute the user's HDD with multiple copies of this dependency. That said, there's not a ton out of pyspark that we really need, since the core functionality already exists in Vizier. One idea would be to use ScalaPy to invoke spark via Scala objects... I think the only other place that ScalaPy shows up is in generating cloudpickles of Python functions for use as Spark UDFs, but we could probably rip the cloudpickle dependency out.
Another point of discussion last week was that changes to pyenv could break existing workflows that rely on it. Concretely
The point is moot if the environment being modified is unused (and we can track this). However, we need to be able to handle situations where someone wants to modify an environment without going through the trouble of duplicating it, etc... After a lengthy discussion with @lordpretzel and @mrb24 :
A few ways to pull this off, but the simplest would be to have some sort of settings pane in the UI that the user can toggle to spin up a pyenv in the vizier.db folder. From here, each notebook could have an associated set of python packages.