Enable custom conda virtual environment visible in containerization-argo stage during workflow building

stanleesocca commented 1 month ago

One issue with using a custom conda environment is that it doesn't work well in containerization mode. For example, with a conda environment called lter-life-wadden and the following listing:

import sys, os
homepath = os.path.expanduser("~")
sys.path.append(homepath)

# cell-1
path_sys = sys.path

# cell-2
more_path = [path_sys, "/app/more"]
print(more_path)

I see the following system path where all libraries and binaries are located:

[['/opt/conda/envs/lter-life-wadden/lib/python3.11/lib-dynload', '', '/opt/conda/envs/lter-life-wadden/lib/python3.11/site-packages', '/home/jovyan', '/app/', '/app/'], '/app/more']

Now that works fine for interactive usage. But as soon as the cells are containerized even after specifying the exact base image lter-life-wadden to use during the containerization process, the result list of library/binary and system path is the following in argo logs:

Namespace(id='e319017', more_path='[["/app", "/venv/lib/python311.zip", "/venv/lib/python3.11", "/venv/lib/python3.11/lib-dynload", "/
venv/lib/python3.11/site-packages"], "/app/more"]')
[['/app', '/venv/lib/python311.zip', '/venv/lib/python3.11', '/venv/lib/python3.11/lib-dynload', '/venv/lib/python3.11/site-packages']
, '/app/more']

This look like there are two different system search paths for environment in place here. One for the interactive work and another for the dockerization work. This is not an issue if I want to only use library already installed in the /ven/lib/ but an issue if I need packages not installed there. Ideally, we would want both system paths to be the same (and more importantly, the user's defined sys.path). This will make it possible for user specific environment in the workflow building pipeline.

gpelouze commented 3 weeks ago

Hi @stanleesocca, can you give an concrete example of the issue you are trying to solve by manipulating sys.path? Perhaps you are trying to add modules that cannot be installed with pip?

The conda environments are indeed found at different locations when you run the notebook (/opt/conda/envs/<conda environment name>/lib/python3.11/) and in the containerized cell (/venv/lib/python3.11). This allows us have multiple environments for notebooks, but keep a single one in containerized cells (which reduces the image size).

However, this implementation detail should not matter.

If you use packages from the environment, or even install new ones using pip or conda install, they will automatically go to the right location. In this case, you do not need to manipulate Python's sys.path or the system's $PATH.
If you manually install Python modules to a custom location (eg. /my/python/modules/), they will be completely independent from the conda environment. In this case, you have to manage the path yourself, as you found out (eg. with sys.path.append('/my/python/modules/')). Keep in mind: this needs to be added to each cell, because they run as independent Python scripts in the containerized context. I suspect that it is why the appended path doesn’t show up in your Argo example.

The following should work, with cell 2 printing hello, world!:

# cell 1
import os

custom_path = '/tmp/data/my/python/modules/'

os.makedirs(custom_path, exist_ok=True)

with open(os.path.join(custom_path, 'something_custom.py'), 'w') as f:
    f.write('print("hello, world!")\n')

# cell 2
import sys
sys.path.append(custom_path)

import something_custom

Here is a general note for people stumbling upon this issue: the second approach should be avoided if possible. Try to rely on packages available on PyPI or conda-forge (these are detected during containerization and automatically added to the environment), or run pip install git+... from within the cell.

stanleesocca commented 3 weeks ago

Hi @gpelouze thanks for the commentary on this issue. As for the problem faced here, the issue has to do with containerization of any cell which call/depend on python or R module/package that are not in the official repository (pip, CRAN etc).

For example, I have some set of python function bundled into a module/package, when I try to use these functions by importing/loading into the cell, this bring some dependencies in the proceeding cell. Looking at the corresponding environment.yaml in NaaVRE parent folder, one can see the listing on the module as part of the requirement to be retrieved via conda-forge to the venv environment.

name: venv
channels: 
   - conda-forge
dependencies:
   - pip
   - python>=3.8
   - nbconvert
   - dtSat (This is the package I need to use in other cell)
   - papermill
   - ipykernel

Of course, the package/module are currently in development and not in pip or any public platform. So, I have an issue here.

How can we overcome this problem for the time being? How can one containerize properly a cell which depends on some custom package/module? I tried some hack (e.g cloning to /tmp/data or pip install ing on a cell before computation etc, but they somehow fail/seem to be an indefinite loop in Argo.

gpelouze commented 3 weeks ago

Is this the generated environment: https://github.com/QCDIS/NaaVRE-cells-test-2/blob/0a3c8e4544387da8958b5bdf34b52ae15204269f/cell-2-stanley-nmor-nioz-nl/environment.yaml, and is this the repo containing dtSat: https://github.com/stanleesocca/dtRemoteSensing?

If that is the case, it can already be installed with pip install git+https://github.com/stanleesocca/dtRemoteSensing. At this stage, it’s not needed to publish it to PyPI. :-)

The following should work (but there is a bug, see below):

# (do not containerize)
# install packages in the notebook environment
!pip install git+https://github.com/stanleesocca/dtRemoteSensing.git

# test-dtsat-1
date = '2024-08-19'

# test-dtsat-2
from dtSat import dtSat
dtSat.get_date(date)

How this should work:

For the notebook environment, the dtRemoteSensing package is installed manually with pip in the first cell.
For the containerized cell environment, the dtSat package is detected by the code analyzer, which adds git+https://github.com/stanleesocca/dtRemoteSensing to the environment.yaml (this is thanks to the module name mapping added on Friday). This in turn installs dtSat in the conda environment, which can be imported with no further effort.

What doesn’t work:

The installation of packages from git sources with pip does not work with the most base images, because they do not contain the git package in the conda environment. Therefore, the containerization of cell test-dtsat-2 will fail with the following error:

#12 23.07 Installing pip packages: git+https://github.com/stanleesocca/dtRemoteSensing.git
#12 23.72 Collecting git+https://github.com/stanleesocca/dtRemoteSensing.git (from -r /tmp/mambafLLMJROqlZM (line 1))
#12 23.72   Cloning https://github.com/stanleesocca/dtRemoteSensing.git to ./pip-req-build-0ajqy1b_
#12 23.72   ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git version
#12 23.72 ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
#12 23.87 critical libmamba pip failed to install packages

I have created the following issue to resolve the problem: https://github.com/QCDIS/NaaVRE-flavors/issues/33. I will try to resolve it in the following days.

stanleesocca commented 3 weeks ago

I see what you mean, when I tried your suggestion. I get this from the containerization stage:

This happen for cell-1 where all imported modules were called and initial variable assignment occur. This is stage I can get pass.

I feel issue https://github.com/QCDIS/NaaVRE-flavors/issues/33 will fix some of these issue (at least from the python side, R we will need to think how we can plug in CRAN there also).

gpelouze commented 3 weeks ago

Do you get the same error if you remove the comments from the first cell (ie. only have the line starting with !pip install)?

stanleesocca commented 3 weeks ago

Without the ! pip install code, I was able to containerize the notebook. However, on Argo, it seems like the computation is stuck in cell-2 (i.e the heart of this example), can you deduce why this is so. The logs are currently empty, and I can't seem to debug/understand why the computation isn't running correctly at this stage.

gpelouze commented 3 weeks ago

Yes, this is because of https://github.com/QCDIS/NaaVRE-flavors/issues/33. I'm working on a fix.

stanleesocca commented 3 weeks ago

Great. Keep me posted on that. I can cancel that run on Argo to save compute time.

gpelouze commented 3 weeks ago

It looks like the fix to https://github.com/QCDIS/NaaVRE-flavors/issues/33 does the trick.

You can try the new version in our dev deployment: https://lifewatch.lab.uvalight.net/vreapp, then select “NaaVRE dev”, “Launch my instance”, “Waddenzee proto DT” and “Start”.

In this environment, I can run the above example as a notebook (removing the comments above !pip install ...), containerize the cells, and run the workflow.

gpelouze commented 1 week ago

Stanley confirmed that it worked.

QCDIS / NaaVRE

Enable custom conda virtual environment visible in containerization-argo stage during workflow building #1556