KhiopsML / khiops-python

The Python library of the Khiops AutoML suite
https://khiops.org
BSD 3-Clause Clear License
8 stars 1 forks source link

Vendor Khiops with pyKhiops #83

Closed popescu-v closed 1 year ago

popescu-v commented 1 year ago

Description

From Khiops 10.1.0, there will be no need of license to install it. This in theory would allow to vendor Khiops within the pyKhiops package.

Ideas / Questions

popescu-v commented 1 year ago

Further questions:

popescu-v commented 1 year ago

First idea: override install, so that the Khiops package is retrieved when pip install-ing pyKhiops. However, as authenticated access to the artifactory is needed, a more doable way seems to be to override build_ext which can retrieve and embed Khiops within the pyKhiops wheel. Indeed, python setup.py bdist is done in controlled settings (in the Gitlab CI/CD) where authentication to the artifactory is not an issue.

popescu-v commented 1 year ago

Second idea: as we vendor Khiops at build time, we can build the Python package wheel within the pykhiops-cicd Docker image, where we have access to Khiops Debian repos, hence we can do apt-get download khiops-core=<relevant_version>.

popescu-v commented 1 year ago

To sum up between these two ideas:

1/ We override build_py's run method in setup.py to:

2/ We override the py-package job of the build CICD stage to do:setup.py sdist .

The bdist is done in the py-custom-release job of the publish stage. Thus, a wheel package is generated, which contains the Khiops files (driver scripts, MODL binaries and jar files).

We need to add two entry points which have the same name and call the khiops (and khiops_coclustering) launch scripts. And perhaps another two entry points for the MODL and MODL_Coclustering binaries. The reason is that these launch scripts and binaries are installed in the package install directory (inside the virtualenv's lib/site-packages), but they need to be available in the path (inside the virtualenv's bin directory). The same goes for the two JARs: they need to be available in the virtualenv's share directory.

These actions take place within the pyKhiops CICD Docker image, hence the Khiops repositories are available.

3/ We override install's run method in setup.py to copy the Khiops scripts, binaries and jar files packaged during build, to the installation directory. Thus, we do not need to override the install setup.py stage.

popescu-v commented 1 year ago

Question: should we also support other Linux packaging options than dpkg/apt (e.g. yum, apk, pacman...)?

popescu-v commented 1 year ago

One important point: the Bash scripts khiops-env, khiops and khiops_coclustering assume that the Khiops package has been installed so that the binaries are available in /usr/bin and Java libraries in /usr/share (note the absolute paths in both cases). Or, this does not work with Pip-based vendored installs (usually inside virtualenvs).

Hence, the logic in these three scripts should be replicated in two Python entry-points (for setuptools), so that the appropriate environment is set-up, irrespective of the absolute paths where the MODL* binaries and Java JARs are located.

popescu-v commented 1 year ago

After further analysis and discussions, it might be better to replicate the logic of khiops-env into a Python entry-point which would:

Thus, the actual paths to the Khiops binaries can remain in the package's installation directory. The khiops and khiops_coclustering scripts are not needed. And neither are the two Khiops JAR libraries (as they are not required by MODL when used in batch mode - as it is the case with pyKhiops).

popescu-v commented 1 year ago

The entry point is not necessary, and I don't think we need to replicate exactly the script.

This is because the khiops-env is used in "dump" mode ; It writes the environment variables to the stdout and the pyKhiops runner parses this output.

So instead of an entry point, which would set the variables and then output them, a special method, such as _initialize_environment_from_vendored_khiops could do this job. Note that the relevant environment variables are all stored in the runner.

popescu-v commented 1 year ago

Yes, indeed, if there is consensus on changing the pykhiops (runner) code; and we thus can also set the appropriate path to the MODL and MODL_Coclustering binaries.

My suggestion of the entry point was assuming that the pykhiops proper code base would not be changed in any way (at least while in POC mode, pending the validation of the vendoring approach itself, including on other target operating systems).

As for the replication, IMHO the CPU and MPI-related logic and the path setting are all that it needs to be replicated (either in the special runner method or in the entry point).

popescu-v commented 1 year ago

The runner is the place where these changes must go because it is his responsability to set the execution environment. If you want to do it with an entry point as a prototype it is ok, but that code should go to the runner in the final version.

popescu-v commented 1 year ago

Yes.

popescu-v commented 1 year ago

Concerning replicating khiops-env functionality:

Anyways, we could also build two packages:

The advantage of this is that the version of the python-khiops PIP package can exactly match the Khiops version itself, and that pykhiops can depend on specific versions (or version minimum values) of python-khiops. Thus, no hack would be necessary to determine the compatible Khiops version for a specific version of pykhiops.

popescu-v commented 1 year ago

I still would name it pykhiops-bin.

popescu-v commented 1 year ago

Ok for pykhiops-bin

popescu-v commented 1 year ago

Roadmap of a potential solution for windows:

popescu-v commented 1 year ago

Meeting with LAG: Defined the roadmap for a first version of the solution to this issue

popescu-v commented 1 year ago

Technically, for the Linux packaging story, this is what remains to be done:

popescu-v commented 1 year ago

Meeting with LAG, BG, SG, VP: Redefined again the short-term roadmap

popescu-v commented 1 year ago

IMHO, we should also check that we can do pip install inside a Conda environment, to make sure we can smoothly mix Conda packages with Pip packages.

popescu-v commented 1 year ago

On the PyKhiops side, IMHO the short-term roadmap could be:

  1. Build and deliver Conda package for vanilla (Python only) PyKhiops
  2. [Temporary step, pending the availability of a Khiops Conda package] Vendor Khiops binaries by surgery on the Debian package (as explored until now for Pip-based vendoring); the purpose of this is to test 3. below. In the longer run, the vanilla PyKhiops Conda package should depend on the Khiops Conda package
  3. Install the correct OpenMPI version through Conda, by using the existing OpenMPI Conda package. The correct version of the OpenMPI package would be, at first, extracted from the control file of the Debian package (see step 2. above). In the longer run, the correct version of the OpenMPI package would be specified by the Khiops Conda package.
popescu-v commented 1 year ago
  1. Also, the (py)Khiops Pip package can be installed inside the Conda environment. The Pip package can vendor the Khiops binaries. The right OpenMPI can be installed inside the Conda environment. The right version of OpenMPI can be present in the installation documentation.
popescu-v commented 1 year ago

Apparently, OpenMPI can be statically built (although the default is to build it dynamically): https://www.open-mpi.org/faq/?category=building#static-build.

popescu-v commented 1 year ago

What is the impact of this?

popescu-v commented 1 year ago

I expect this to facilitate vendoring MPI on Linux as well: AFAIU, Khiops would have its own copy of statically-linked OpenMPI which could be vendored without needing a system-wide install.

popescu-v commented 1 year ago

So with this solution we could vendor it with pip as well no?

And the compilation of MPI should be included in that of Khiops ?

popescu-v commented 1 year ago

AFAIU, this is a possibility, but IMHO we would to change the Khiops release (and compilation) process itself:

  1. compile OpenMPI as a static library
  2. Compile Khiops and statically link it to the compiled OpenMPI library
  3. Thus the Khiops binary would be bigger, but it would be self-contained as far as MPI support is concerned.

However, we should check if there is any obstacle to statically linking Khiops to OpenMPI (some code changes would be needed AFAIK, as DLL calling is more involved than statically-linked library calling which is seamless).

popescu-v commented 1 year ago

Note that compiling statically MPI has drawbacks and it is discouraged https://docs.open-mpi.org/en/v5.0.x/building-apps/building-static-apps.html

Notably it would make the application fatter and so it would be each spawned process (but maybe it is not an issue)

popescu-v commented 1 year ago

Yep. I'am also considering keeping the dynamic link to libmpich and, at build time:

popescu-v commented 1 year ago

Using conda, we can have:

0/ Activate conda environment: conda activate <environment_name>

1/ conda install mpich This installs ~/miniconda3/bin/mpiexec and ~/miniconda3/lib/libmpich.so (among other MPI-related DLLs that AFAIU are not needed, e.g. for Fortran etc.)

2/ Inside ~/miniconda/lib/ we need to create a symlink from libmpich.so to libmpich.so.12: ln -s libmpich.so libmpich.so.12, because MODL expects libmpich.so.12

3/ Launch MODL with the LD_LIBRARY_PATH set to /absolute/path/to/miniconda3/lib. Otherwise, unless libmpich.so.12 exists in /lib/<arch>/ system-wide, the link fails. This is true, whether we are "inside" the conda environment or not. What being inside the conda environment brings us is that mpiexec is readily in the path.

popescu-v commented 1 year ago

The libraries aren't stored in the conda environment ?

popescu-v commented 1 year ago

Yes, they are stored in the conda environment, but the MODL binary doesn't look them up in there by default. It's like the conda environment sets the PATH but not the LD_LIBRARY_PATH.

What the conda environment certainly brings us though, is the mpiexec executable, which is the one in the conda environment.

popescu-v commented 1 year ago

Ok, so it is possible to set LD_LIBRARY_PATH via the conda script ?

popescu-v commented 1 year ago

Or at least modify khiops-env at install time so everything is ok ?

popescu-v commented 1 year ago

First off, conda does not set the LD_LIBRARY_PATH by default upon activating the environment. However, there are two possibilities:

1/ [not applicable in our case in its own] tweaking the environment itself, by adding activate.d/env_vars.sh (for variable set-up) and deactivate.d/env_vars.sh (for variable unset) to the environment (see https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux)

2/ [applicable in our case, IMHO, by also leveraging 1/] customizing the installation process, by adding pre-link / post-link scripts for variable setup upon package installation, and post-unlink scripts for variable unset after uninstall (see https://docs.conda.io/projects/conda-build/en/latest/resources/link-scripts.html).

I would go for using:

For a discussion, see also https://stackoverflow.com/q/46826497.

popescu-v commented 1 year ago

Regarding the khiops-env, khiops and khiops_coclustering scripts, IMHO there are two options:

1/ tweaking the three scripts from the post-link script (which is run on conda install so that the LD_LIBRARY_PATH is set); this is the most conservative approach with respect to the current status, but:

2/ [as written in the previous comment] we only tweak the conda environment, make do without the three khiops* scripts and replicate their logic in Python as needed (this had already been started previously); this requires adding more logic to the local runner, but IMHO it is more portable between OSes and less involved from the Conda scripts perspective.

popescu-v commented 1 year ago

First trial:

popescu-v commented 1 year ago

As of now, Conda packaging has been implemented for PyKhiops + vendored Khiops, which supports Ubuntu Linux.

Two Conda packages have been implemented for this:

Note 1 The current setup of having the Conda packages based on the Pip/setuptools mechanics has the advantage that it allows for three installation scenarios:

1/ the user already has Khiops and MPI support installed system-wide via the Ubuntu / Debian package; in this case pip install pykhiops just performs a source-install of the PyKhiops Python library in the current Python virtualenv; this is the current officially supported scenario, which will continue to be supported

2/ the user already has MPI support installed system-wide via the Ubuntu / Debian mpich package; in this case, pip install pykhiops-bin installs vendored Khiops binaries in the current Python virtualenv, and pip install pykhiops installs the PyKhiops Python library as in scenario 1/. Please note that pykhiops-bin is not a dependency of the pykhiops Pip package, so that scenario 1/ is also supported with the same pykhiops Pip package; vendored Khiops support must be explicitly installed

3/ the user has neither Khiops, nor MPI installed system-wide; in this case, conda install pykhiops installs the mpich Conda package that provides MPI support within the Conda environment, pykhiops-bin which provides vendored Khiops support within the Conda environment, as well as PyKhiops Conda dependencies pandas and scikit-learn.

Note 2 The vendored Khiops installed in a Python virtualenv or in a Conda environment always takes precedence over any system-wide Khiops installation. Likewise, the MPI support installed in a Conda environment always takes precedence over any system-wide MPICH installation.

popescu-v commented 1 year ago

Meeting with Bruno G.:

popescu-v commented 1 year ago

With respect to optional dependencies, if no mechanism equally convenient as pip's extras is available I would just let the khiops conda package as the only option and document which of those options would be available by default.

popescu-v commented 1 year ago

What does "by default" mean here? In the Conda world, normally, a dependency is either required or it is not at all; AFAIK, there is no default behavior vs. customizable behavior. What we can do however, AFAIU, is to specify a package, plus metapackages on top of it, which aggregate it with other Conda packages.

popescu-v commented 1 year ago

Experiment attempted today: installing the pykhiops (with the pykhiops-bin Conda dependency) Conda packages on CentOS Stream 8.

Conclusions:

popescu-v commented 1 year ago

Isn't easier to have a CentOS image to make the CentOS build ?

popescu-v commented 1 year ago

It should be, AFAIU. And it should be doable in a Docker container IMHO (if all we need is cmake, libc, mpich, plus conda of course).

popescu-v commented 1 year ago

But the cross-compilation path could be potentially interesting for other, less accessible targets, like Windows or MacOS.

popescu-v commented 1 year ago

Cross-compilation seems easier for unix-likes. For Windows it seems very difficult.

popescu-v commented 1 year ago

Right, apparently even in the article referenced above, the Windows installer needs to be built on Windows. The advantage of Conda constructor however, seems to be that it abstracts away much of the nitty-gritty details of building the Conda installer.

popescu-v commented 1 year ago

The PyKhiops Conda package seems to work with the native (Ubuntu) khiops-bin Conda package provided here by @bruno.guerraz : https://repos.tech.orange/ui/native/khiops-virt-conda-stable/

The executables are in place, pk-status works; all pykhiops sklearn end-to-end tests are green, with fixture-based mocks disabled and native Khiops calls enabled.

Thus, with the native khiops-bin package we obtain the same behaviour, on Ubuntu, as with the pykhiops-bin package (which is extracting the binaries from the Debian package). The only difference is in the Khiops version (10.1.1 for pykhiops-bin because of the version finding heuristic, 10.1.3 for khiops-bin). This is OK for the pykhiops package, as it only needs to specify that the khiops-bin>=10.1.1 (that is, the version of pykhiops itself).

popescu-v commented 1 year ago

The pykhiops Conda package works, along with the khiops-bin Conda package, on both Ubuntu 22.04 and CentOS Stream 7 Docker containers: python -m tests.test_samples works as expected in both containers, both with the two Conda packages installed in a Conda environment.

Hence, the khiops-bin package (as built by @bruno.guerraz on Ubuntu) transparently works in both the Ubuntu and CentOS Docker containers. AFAIU, this is because Conda does some patchelfs on the MODL and MODL_Coclustering binaires, so that they work with the Conda-provided dynamic library dependencies, like GLIBC, MPICH, etc.

Now we need to industrialize all this:

1/ Create relevant Khiops repository on the artifactory; perhaps https://repos.tech.orange/ui/native/khiops-virt-conda-{stable,unstable}/ will do?

2/ Create Conda channel in this repository; AFAIU, this is supported by JFrog.

3/ Push, via cURL, the *.bz2 packages to this channel; this should allow to regenerate the Conda channel index here.

popescu-v commented 1 year ago

Following this week discussions, the following steps are to be done:

1/ Precisely list the supported target platforms:

3/ Push current khiops-bin Conda package on khiops-labs-virt-conda-unstable, so that we can: CI/CD-ize the current pykhiops, pykhiops.s3 and pykhiops.gcs Conda package manufacture and pushing on the khiops-labs-virt-conda-unstable repository (the khiops-bin package is needed even for manufacturing the pykhiops* packages)

4/ Add vendored Khiops support for Windows and Mac OS in the PyKhiops Python runner; to this end:

5/ Sync-up with @bruno.guerraz on Khiops releases: should both khiops-bin and pykhiops Conda packages be hosted on the same channel, khiops-virt-conda-{stable,unstable} (not khiops-labs-virt-conda-*)?

popescu-v commented 1 year ago

regarding architecture, should it be compatible with both x86 and arm on MacOS ? ARM is "obvious", but x86 Macs were still manufactured last year...