Closed popescu-v closed 1 year ago
Further questions:
pykhiops
, why do we need source and wheel? Wheel for consumers (potentially via pykhiops-full
, but also by using a preexisting Khiops installation), and source for developers?First idea: override install
, so that the Khiops package is retrieved when pip install
-ing pyKhiops. However, as authenticated access to the artifactory is needed, a more doable way seems to be to override build_ext
which can retrieve and embed Khiops within the pyKhiops wheel. Indeed, python setup.py bdist
is done in controlled settings (in the Gitlab CI/CD) where authentication to the artifactory is not an issue.
Second idea: as we vendor Khiops at build time, we can build the Python package wheel within the pykhiops-cicd Docker image, where we have access to Khiops Debian repos, hence we can do apt-get download khiops-core=<relevant_version>
.
To sum up between these two ideas:
1/ We override build_py
's run
method in setup.py
to:
apt-get download
of the determined Khiops version, according to the pyKhiops version and to the OS (platform, version), into the build directory; if this fails, then fail the builddpkg-deb -R
, the contents of the Khiops package and copy the scripts (khiops
and khiops_coclustering
), the binaries (MODL
and MODL_Coclustering
) and the jar files to the build directory, so that they are included in the binary distribution Wheel.2/ We override the py-package
job of the build
CICD stage to do:setup.py sdist
.
The bdist is done in the py-custom-release
job of the publish
stage.
Thus, a wheel package is generated, which contains the Khiops files (driver scripts, MODL binaries and jar files).
We need to add two entry points which have the same name and call the khiops
(and khiops_coclustering
) launch scripts. And perhaps another two entry points for the MODL
and MODL_Coclustering
binaries. The reason is that these launch scripts and binaries are installed in the package install directory (inside the virtualenv's lib/site-packages
), but they need to be available in the path (inside the virtualenv's bin
directory). The same goes for the two JARs: they need to be available in the virtualenv's share
directory.
These actions take place within the pyKhiops CICD Docker image, hence the Khiops repositories are available.
3/ We override Thus, we do not need to override the install
's run
method in setup.py
to copy the Khiops scripts, binaries and jar files packaged during build, to the installation directory.install
setup.py stage.
Question: should we also support other Linux packaging options than dpkg/apt (e.g. yum, apk, pacman...)?
One important point: the Bash scripts khiops-env
, khiops
and khiops_coclustering
assume that the Khiops package has been installed so that the binaries are available in /usr/bin
and Java libraries in /usr/share
(note the absolute paths in both cases). Or, this does not work with Pip-based vendored installs (usually inside virtualenvs).
Hence, the logic in these three scripts should be replicated in two Python entry-points (for setuptools), so that the appropriate environment is set-up, irrespective of the absolute paths where the MODL*
binaries and Java JARs are located.
After further analysis and discussions, it might be better to replicate the logic of khiops-env
into a Python entry-point which would:
Thus, the actual paths to the Khiops binaries can remain in the package's installation directory. The khiops
and khiops_coclustering
scripts are not needed. And neither are the two Khiops JAR libraries (as they are not required by MODL
when used in batch mode - as it is the case with pyKhiops).
The entry point is not necessary, and I don't think we need to replicate exactly the script.
This is because the khiops-env
is used in "dump" mode ; It writes the environment variables to the stdout and the pyKhiops runner parses this output.
So instead of an entry point, which would set the variables and then output them, a special method, such as _initialize_environment_from_vendored_khiops
could do this job. Note that the relevant environment variables are all stored in the runner.
Yes, indeed, if there is consensus on changing the pykhiops (runner) code; and we thus can also set the appropriate path to the MODL
and MODL_Coclustering
binaries.
My suggestion of the entry point was assuming that the pykhiops proper code base would not be changed in any way (at least while in POC mode, pending the validation of the vendoring approach itself, including on other target operating systems).
As for the replication, IMHO the CPU and MPI-related logic and the path setting are all that it needs to be replicated (either in the special runner method or in the entry point).
The runner is the place where these changes must go because it is his responsability to set the execution environment. If you want to do it with an entry point as a prototype it is ok, but that code should go to the runner in the final version.
Yes.
Concerning replicating khiops-env
functionality:
the KHIOPS_PROC_NUMBER
can be determined (as the number of physical CPUs) via psutil.cpu_count(logical=False)
; but for this, we need to install psutil
as an extra dependency!
Otherwise, we need to resort to doing subprocess.Popen
with lscpu
.
the KHIOPS_PATH
should be set to the correct f"{os.path.dirname(sys.executable)}/lib/pythonx.y/site-packages/pykhiops/usr/bin"
path;
similar to KHIOPS_CLASSPATH
to the JAR files; we need to replicate this in order to support non-batch khiops launch from within pykhiops.core
the MPI-related functionality is trickier, as we need to make sure the appropriate MPI implementation library (the same version Khiops has been compiled with) is made available; and just extracting relevant parts from the OS mpich package will not do, as this package also has dependencies which would need to be handled. A potential solution is to rely on the Intel MPI library and its PIP packages: https://pypi.org/project/impi/, for Linux (https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/pip.html#pip) and for Windows (https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-windows/top/installation/install-using-package-managers/pip.html).
Anyways, we could also build two packages:
python-khiops
which would make the Khiops binaries pip-installable from a wheel.
pykhiops
which would depend on python-khiops
and install pykhiops itself; then, the pykhiops runner can detect if vendored Khiops exists in the current environment (virtualenv, venv, conda, ...) and, if so, properly set the variables indicated above.
The advantage of this is that the version of the python-khiops
PIP package can exactly match the Khiops version itself, and that pykhiops
can depend on specific versions (or version minimum values) of python-khiops
. Thus, no hack would be necessary to determine the compatible Khiops version for a specific version of pykhiops
.
I still would name it pykhiops-bin
.
Ok for pykhiops-bin
Roadmap of a potential solution for windows:
bin
subdirectory which will be copied as data by setuptools
MODL
executablesMeeting with LAG: Defined the roadmap for a first version of the solution to this issue
pk-status
to show this infopykhiops-cicd
dockerTechnically, for the Linux packaging story, this is what remains to be done:
khiops-env
script functionality replication the the "environment initialisation" code of PyKhiops (when vendored Khiops installation is detected);pk-status
entry point script;MODL
and MODL_Coclustering
executables sequentially and issue warning to the user that this is so (and why);Meeting with LAG, BG, SG, VP: Redefined again the short-term roadmap
pip
in favor of a conda
packaging (VP leads the exploration)
.deb
linux packages ?conda
and conda-forge
enough dependencies for our purpose (MPI and Java?)conda-forge
or should be have our own repository (does GitHub have this option?)conda install khiops
or conda install pykhiops
IMHO, we should also check that we can do pip install
inside a Conda environment, to make sure we can smoothly mix Conda packages with Pip packages.
On the PyKhiops side, IMHO the short-term roadmap could be:
Apparently, OpenMPI can be statically built (although the default is to build it dynamically): https://www.open-mpi.org/faq/?category=building#static-build.
What is the impact of this?
I expect this to facilitate vendoring MPI on Linux as well: AFAIU, Khiops would have its own copy of statically-linked OpenMPI which could be vendored without needing a system-wide install.
So with this solution we could vendor it with pip as well no?
And the compilation of MPI should be included in that of Khiops ?
AFAIU, this is a possibility, but IMHO we would to change the Khiops release (and compilation) process itself:
However, we should check if there is any obstacle to statically linking Khiops to OpenMPI (some code changes would be needed AFAIK, as DLL calling is more involved than statically-linked library calling which is seamless).
Note that compiling statically MPI has drawbacks and it is discouraged https://docs.open-mpi.org/en/v5.0.x/building-apps/building-static-apps.html
Notably it would make the application fatter and so it would be each spawned process (but maybe it is not an issue)
Yep. I'am also considering keeping the dynamic link to libmpich and, at build time:
DEBIAN/control
, check the minimal libmpich12 version required, as well as the mpich
version required (this last one is required for doing mpiexec
in khiops-env
or equivalent Python port in the LocalRunner
)apt-cache show libmpich
, grep Filename
and extract the actual full name of the libmpich12
Debian packagelibmpich12
.deb
file of the package (as done for the khiops-core
package)libmpich.so.12.x.y
shared library in the virtualenv; create symlink libmpich.so.12
to the actual DLLPyKhiopsLocalRunner
, set LD_LIBRARY_PATH
to point to the virtualenv location where the libmpich12
DLL has been put, so that the MODL
executable can find the DLLmpiexec
binary extracted from the mpich
Debian package, with the difference that this time the PATH
variable would be set to point to the location of the mpiexec
binary (e.g. through the env
argument passed to Popen
).Using conda, we can have:
0/ Activate conda environment: conda activate <environment_name>
1/ conda install mpich
This installs ~/miniconda3/bin/mpiexec
and ~/miniconda3/lib/libmpich.so
(among other MPI-related DLLs that AFAIU are not needed, e.g. for Fortran etc.)
2/ Inside ~/miniconda/lib/
we need to create a symlink from libmpich.so
to libmpich.so.12
: ln -s libmpich.so libmpich.so.12
, because MODL
expects libmpich.so.12
3/ Launch MODL
with the LD_LIBRARY_PATH
set to /absolute/path/to/miniconda3/lib
. Otherwise, unless libmpich.so.12
exists in /lib/<arch>/
system-wide, the link fails. This is true, whether we are "inside" the conda environment or not. What being inside the conda environment brings us is that mpiexec
is readily in the path.
The libraries aren't stored in the conda environment ?
Yes, they are stored in the conda environment, but the MODL
binary doesn't look them up in there by default. It's like the conda environment sets the PATH
but not the LD_LIBRARY_PATH
.
What the conda environment certainly brings us though, is the mpiexec
executable, which is the one in the conda environment.
Ok, so it is possible to set LD_LIBRARY_PATH
via the conda script ?
Or at least modify khiops-env
at install time so everything is ok ?
First off, conda does not set the LD_LIBRARY_PATH
by default upon activating the environment.
However, there are two possibilities:
1/ [not applicable in our case in its own] tweaking the environment itself, by adding activate.d/env_vars.sh
(for variable set-up) and deactivate.d/env_vars.sh
(for variable unset) to the environment (see https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux)
2/ [applicable in our case, IMHO, by also leveraging 1/] customizing the installation process, by adding pre-link
/ post-link
scripts for variable setup upon package installation, and post-unlink
scripts for variable unset after uninstall (see https://docs.conda.io/projects/conda-build/en/latest/resources/link-scripts.html).
I would go for using:
post-link
for tweaking the current conda environment (given by CONDA_PREFIX
) for adding the
{activate,deactivate}.d/env_vars.sh
, just after installing the khiops package. post-unlink
for canceling the env_vars.sh
tweaks just before uninstalling the package.For a discussion, see also https://stackoverflow.com/q/46826497.
Regarding the khiops-env
, khiops
and khiops_coclustering
scripts, IMHO there are two options:
1/ tweaking the three scripts from the post-link
script (which is run on conda install
so that the LD_LIBRARY_PATH
is set); this is the most conservative approach with respect to the current status, but:
khiops*
script tweaking logic in a Python function that would be called from the Linux and Windows versions of the post-link
script;PyKhiopsLocalRunner
, code for distinguishing between system-wide Khiops installation (e.g. via the Debian / Ubuntu package or the Windows installer) and vendored Khiops installation via conda.2/ [as written in the previous comment] we only tweak the conda environment, make do without the three khiops*
scripts and replicate their logic in Python as needed (this had already been started previously); this requires adding more logic to the local runner, but IMHO it is more portable between OSes and less involved from the Conda scripts perspective.
First trial:
upon building the Conda package (by packaging the Pip package into the Conda package):
upon installing conda package:
post-link.sh
script:LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH $STDLIB_DIR/site-packages/pykhiops-bin/MODL "$@"
MODL_Coclustering
As of now, Conda packaging has been implemented for PyKhiops + vendored Khiops, which supports Ubuntu Linux.
Two Conda packages have been implemented for this:
pykhiops-bin
, which:
python
, setuptools
and wheel
Conda packagessetuptools
/ Pip wheel file of the package; to this end, it uses the setup.py
file to:khiops-core
whose version most closely matches the version of the specified pykhiops-bin
Conda package (which is the same as the version of the latest PyKhiops release)MODL
and MODL_Coclustering
binariespykhiops-bin
Conda packagempich
and python
Conda packagespykhiops-bin
Conda package in the current Conda environmentMODL
and MODL_Coclustering
executables in the current PATH
of the current Conda environmentpykhiops
, which:
python
and setuptools
Conda packagespykhiops
pykhiops
Pip package into the eponymous Conda packagepykhiops-bin
Conda package, as well as on the other Python Conda packages (scikit-learn
and pandas
)pykhiops
Conda package in the current Conda environmentNote 1 The current setup of having the Conda packages based on the Pip/setuptools mechanics has the advantage that it allows for three installation scenarios:
1/ the user already has Khiops and MPI support installed system-wide via the Ubuntu / Debian package; in this case pip install pykhiops
just performs a source-install of the PyKhiops Python library in the current Python virtualenv; this is the current officially supported scenario, which will continue to be supported
2/ the user already has MPI support installed system-wide via the Ubuntu / Debian mpich
package; in this case, pip install pykhiops-bin
installs vendored Khiops binaries in the current Python virtualenv, and pip install pykhiops
installs the PyKhiops Python library as in scenario 1/. Please note that pykhiops-bin
is not a dependency of the pykhiops
Pip package, so that scenario 1/ is also supported with the same pykhiops
Pip package; vendored Khiops support must be explicitly installed
3/ the user has neither Khiops, nor MPI installed system-wide; in this case, conda install pykhiops
installs the mpich
Conda package that provides MPI support within the Conda environment, pykhiops-bin
which provides vendored Khiops support within the Conda environment, as well as PyKhiops Conda dependencies pandas
and scikit-learn
.
Note 2 The vendored Khiops installed in a Python virtualenv or in a Conda environment always takes precedence over any system-wide Khiops installation. Likewise, the MPI support installed in a Conda environment always takes precedence over any system-wide MPICH installation.
Meeting with Bruno G.:
MODL
and MODL_Coclustering
binaries, plus the MPI supportpykhiops-bin
(which currently relies on the OS packages, viz. Ubuntu / Debian Linux) would be superseded by the Khiops conda package (see previous point)pykhiops
Conda package would depend on the Khiops Conda package; the version of this dependency can be specified in a standard way in Conda pykhiops
package's metadata; this would be less brittle than the current heuristic that is used for finding the "closest" version of the OS package that matches the current pykhiops
versionboto3
and the like, for remote access: unlike Pip (or Debian, for that matter), Conda does not support these directly; however, Conda outputs could be used [this needs to be explored]; see https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#outputs-section_mpi*
string in its name, viz. MODL
should be called MODL_mpich_...
; question: where can the khiops
RPMs be found?With respect to optional dependencies, if no mechanism equally convenient as pip's extras
is available I would just let the khiops
conda package as the only option and document which of those options would be available by default.
What does "by default" mean here? In the Conda world, normally, a dependency is either required or it is not at all; AFAIK, there is no default behavior vs. customizable behavior. What we can do however, AFAIU, is to specify a package, plus metapackages on top of it, which aggregate it with other Conda packages.
Experiment attempted today: installing the pykhiops
(with the pykhiops-bin
Conda dependency) Conda packages on CentOS Stream 8.
Conclusions:
the Conda package installation itself works as on Ubuntu
however, as expected, the MODL
and MODL_Coclustering
executables fail, because they have been obtained from the Ubuntu / Debian package and are thus not suited for running on CentOS; more precisely:
MODL -v
yields:
`/<path/to/conda/environment>/lib/python3.11/site-packages/pykhiops-bin/usr/bin/MODL: /lib64/libm.so.6: version
GLIBC_2.29' not found (required by /<path/to/conda/environment>/lib/python3.11/site-packages/pykhiops-bin/usr/bin/MODL)however, if Khiops is built within Conda, the proper libc will be selected by Conda according to the Linux distribution it is compiled on
can we cross-compile Khiops (compile it on Ubuntu, targetting CentOS) from within Conda (using the Conda-based workflow)? See https://medium.com/@Amet13/building-a-cross-platform-python-installer-using-conda-constructor-f91b70d393 and https://conda.github.io/constructor/howto/ for a starting point
Isn't easier to have a CentOS image to make the CentOS build ?
It should be, AFAIU. And it should be doable in a Docker container IMHO (if all we need is cmake, libc, mpich, plus conda of course).
But the cross-compilation path could be potentially interesting for other, less accessible targets, like Windows or MacOS.
Cross-compilation seems easier for unix-likes. For Windows it seems very difficult.
Right, apparently even in the article referenced above, the Windows installer needs to be built on Windows. The advantage of Conda constructor
however, seems to be that it abstracts away much of the nitty-gritty details of building the Conda installer.
The PyKhiops Conda package seems to work with the native (Ubuntu) khiops-bin
Conda package provided here by @bruno.guerraz : https://repos.tech.orange/ui/native/khiops-virt-conda-stable/
The executables are in place, pk-status
works; all pykhiops sklearn end-to-end tests are green, with fixture-based mocks disabled and native Khiops calls enabled.
Thus, with the native khiops-bin
package we obtain the same behaviour, on Ubuntu, as with the pykhiops-bin
package (which is extracting the binaries from the Debian package). The only difference is in the Khiops version (10.1.1 for pykhiops-bin
because of the version finding heuristic, 10.1.3 for khiops-bin
). This is OK for the pykhiops
package, as it only needs to specify that the khiops-bin>=10.1.1
(that is, the version of pykhiops
itself).
The pykhiops
Conda package works, along with the khiops-bin
Conda package, on both Ubuntu 22.04 and CentOS Stream 7 Docker containers: python -m tests.test_samples
works as expected in both containers, both with the two Conda packages installed in a Conda environment.
Hence, the khiops-bin
package (as built by @bruno.guerraz on Ubuntu) transparently works in both the Ubuntu and CentOS Docker containers. AFAIU, this is because Conda does some patchelf
s on the MODL
and MODL_Coclustering
binaires, so that they work with the Conda-provided dynamic library dependencies, like GLIBC, MPICH, etc.
Now we need to industrialize all this:
1/ Create relevant Khiops repository on the artifactory; perhaps https://repos.tech.orange/ui/native/khiops-virt-conda-{stable,unstable}/ will do?
2/ Create Conda channel in this repository; AFAIU, this is supported by JFrog.
3/ Push, via cURL, the *.bz2 packages to this channel; this should allow to regenerate the Conda channel index here.
Following this week discussions, the following steps are to be done:
1/ Precisely list the supported target platforms:
3/ Push current khiops-bin
Conda package on khiops-labs-virt-conda-unstable
, so that we can: CI/CD-ize the current pykhiops
, pykhiops.s3
and pykhiops.gcs
Conda package manufacture and pushing on the khiops-labs-virt-conda-unstable
repository (the khiops-bin
package is needed even for manufacturing the pykhiops*
packages)
4/ Add vendored Khiops support for Windows and Mac OS in the PyKhiops Python runner; to this end:
5/ Sync-up with @bruno.guerraz on Khiops releases: should both khiops-bin
and pykhiops
Conda packages be hosted on the same channel, khiops-virt-conda-{stable,unstable}
(not khiops-labs-virt-conda-*
)?
regarding architecture, should it be compatible with both x86 and arm on MacOS ? ARM is "obvious", but x86 Macs were still manufactured last year...
Description
From Khiops 10.1.0, there will be no need of license to install it. This in theory would allow to vendor Khiops within the pyKhiops package.
Ideas / Questions
pip
?pykhiops
(source and wheel) andpykhiops-full
(wheel). The later will contain the vendored Khiops.pykhiops-full
will depend onpykhiops
.setup.py
with the platform and architecture. For each platform and architecture:setup.py
script probably overriding theinstall
orbuild_ext
steps.