Operon fails to launch from "evaluate_model.py" on Docker image

fc59283 commented 1 year ago

Hi there,

Trying to use Operon with "evaluate_model.py" on the Docker image, returns the error:

» python evaluate_model.py /data/eq1.tsv -ml sembackpropgp -seed 42 -skip_tuning
(...)
ImportError: libpython3.9.so.1.0: cannot open shared object file: No such file or directory

whereas, for example: python evaluate_model.py /data/eq1.tsv -ml sembackpropgp -seed 42 -skip_tuning

does launch the regressor.

Any ideas?

PS: Operon also fails to launch on the "192_vineyard" dataset - where other methods do not.

Many thanks!

fc59283 commented 1 year ago

I suppose we could upgrade the Python version inside the Docker container's Conda environment... but won't this possibly brake things and/or mess up reproducibility?

foolnotion commented 1 year ago

This looks like a linking error or possibly a problem with the conda environment.

If operon was installed inside a Conda environment inside the docker image, then theoretically it will have detected the existing python and linked against it. Running ldd on the pyoperon library file (which you will need to locate yourself - it should look like pyoperon.cpython-39-x86_64-linux-gnu.so) will tell you exactly which dependency is missing.

You can try to do two things:

Install pyoperon with pip: pip install pyoperon OR
Compile the latest pyoperon from git
- install this minimal conda env https://gist.github.com/foolnotion/5525410773e50940a9a7cd23fa5a2e39
- git clone https://github.com/heal-research/pyoperon.git
- git switch cpp20
- bash script/dependencies.sh
- pip install .

I recommend option 2 which will install the current release candidate which contains a significant number of improvements.

fc59283 commented 1 year ago

Thanks for that. However, shouldn't SRBench's docker image, installed from the provided Dockerfile, have all methods functioning correctly "out of the box"? Even to reproduce the paper's results: I want to start there before upgrading individual methods.

My goal is to perform a regression with all of SRBench's methods on my specific data.

foolnotion commented 1 year ago

shouldn't SRBench's docker image, installed from the provided Dockerfile, have all methods functioning correctly "out of the box"?

It should, but it looks like you ran into a bug.

If you want to use the exact same version of operon you could simply rerun the script here https://github.com/cavalab/srbench/blob/master/experiment/methods/src/operon_install.sh

lacava commented 1 year ago

hi @fc59283 , yes the Docker should work. Can you provide the output of ldd as @foolnotion suggested to help us debug?

If your goal is to reproduce the paper exactly though, I would try using the v2.0 release. Operon's install script has a pinned repo version https://github.com/cavalab/srbench/blob/v2.0/experiment/methods/src/operon_install.sh .

Note that the docker file and the master branch install script for operon are both newer than the paper.

fc59283 commented 1 year ago

Hi @lacava, I get:

ldd /opt/conda/envs/srbench/lib/python3.7/site-packages/operon/pyoperon.cpython-37m-x86_64-linux-gnu.so
    linux-vdso.so.1 (0x00007ffd24787000)
    libpython3.9.so.1.0 => not found
    liboperon.so.0 => /opt/conda/envs/srbench/lib/liboperon.so.0 (0x00007fcbe100e000)
    libfmt.so.7 => /opt/conda/envs/srbench/lib/libfmt.so.7 (0x00007fcbe0fdb000)
    libstdc++.so.6 => /opt/conda/envs/srbench/lib/libstdc++.so.6 (0x00007fcbe0e27000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcbe0cd8000)
    libgcc_s.so.1 => /opt/conda/envs/srbench/lib/libgcc_s.so.1 (0x00007fcbe0cbf000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcbe0acb000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcbe0aa8000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fcbe11c9000)

foolnotion commented 1 year ago

It looks like the python version in the conda env is python3.7, but pyoperon was linked against python3.9. this is very unusual, somehow the install env was corrupted.

You may be able to use patchelf or chrpath to fix the include path (you would probably need to change libpython3.9.so.1.0 to libpython3.7.so.1.0).

But it's probably much easier to just reinstall operon.

lacava commented 1 year ago

@foolnotion could it be this line ?https://github.com/cavalab/srbench/blob/47da695292938d5e696ddcd4252f4034330ef787/experiment/methods/src/operon_install.sh#L3

foolnotion commented 1 year ago

Actually, building the docker image fails for me, so I can't debug.

Step 12/16 : COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yml /tmp/environment.yml
unable to convert uid/gid chown string to host mapping: can't find uid for user : no such user:

But as far as I remember, that line was pretty reliable in detecting the correct python version in the Conda env. Inside the docker image, the command pkg-config --modversion python3 should return 3.7. Another way to get the current python prefix is to run python -c "import sysconfig; print(sysconfig.get_config_var('prefix'))", like here

fc59283 commented 1 year ago

Actually, building the docker image fails for me, so I can't debug.

Step 12/16 : COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yml /tmp/environment.yml
unable to convert uid/gid chown string to host mapping: can't find uid for user : no such user:

If I remember correctly, using this prefix on the documentation's docker build command solved that for me:

DOCKER_BUILDKIT=1

fc59283 commented 1 year ago

Speaking of common installation & setup hurdles, a humble suggestion to the Project:

Given the universality of Google Colab, what about also including instructions that are tested to run correctly there?

Thanks.

lacava commented 1 year ago

if you are running srbench on google colab and have a set of working instructions, we'd be happy to add them to the docs if you submit a PR. I don't use colab that much but I could see it being useful for others.

foolnotion commented 1 year ago

Thanks @lacava I was now able to build the docker image. The problem is with the srbench conda env.

Here is an excerpt of pyoperon's CMake generation phase:

-- The CXX compiler identification is GNU 11.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/conda/envs/srbench/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Ceres (missing: Ceres_DIR)
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found PythonInterp: /opt/conda/envs/srbench/bin/python (found version "3.7.12") 
-- Found PythonLibs: /opt/conda/envs/srbench/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Found pybind11: /opt/conda/envs/srbench/include (found version "2.6.1" )
-- Found Python3: /opt/conda/bin/python3.9 (found version "3.9.10") found components: Development Interpreter Development.Module Development.Embed 
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /experiment/pyoperon/build

The offending element is here:

-- Found Python3: /opt/conda/bin/python3.9 (found version "3.9.10") found components: Development Interpreter Development.Module Development.Embed

This should of course be /opt/conda/envs/srbench/bin/python. But why does CMake detect the wrong python exe? The problem seems to be that the system environment is leaked inside the srbench environment, for instance this variable (and others)

CONDA_PYTHON_EXE=/opt/conda/bin/python

I haven't figured it out how to fix the env itself but a quick solution is to use patchelf:

conda install -c conda-forge patchelf
 patchelf --replace-needed libpython3.9.so.1.0 libpython3.7m.so /opt/conda/envs/srbench/lib/python3.7/site-packages/operon/pyoperon.cpython-37m-x86_64-linux-gnu.so

fc59283 commented 1 year ago

Thanks for that @foolnotion. Using patchelf does launch Operon, however, on every dataset (I tried) I'm getting:

[...]
Fitting 5 folds for each of 6 candidates, totalling 30 fits
operon warning: array does not satisfy contiguity or storage-order requirements. data will be copied.
Illegal instruction (core dumped)

Any ideas?

foolnotion commented 1 year ago

Illegal instruction (core dumped)

I've seen this before, this happens when the hardware running Operon does not support AVX2. In this case, I think the only option is to reinstall with -march=native or other appropriate compile flags.

I tried to reproduce inside my docker image, but after patchelf, it runs fine on my machine:

python analyze.py --local ./test/192_vineyard_small.tsv.gz -ml OperonRegressor -seed 42

Full output here: https://gist.github.com/foolnotion/3548524435498cf3c8d3fdc999868e22

lacava commented 1 year ago

is there anything we need to to in srbench to fix this?

foolnotion commented 1 year ago

@lacava For the original problem it seems the following change is needed:

diff --git a/experiment/methods/src/operon_install.sh b/experiment/methods/src/operon_install.sh
index 0a4f461..1fb822a 100755
--- a/experiment/methods/src/operon_install.sh
+++ b/experiment/methods/src/operon_install.sh
@@ -117,6 +117,7 @@ pushd pyoperon
 git checkout 1c6eccd3e3fa212ebf611170ca2dfc45714c81de
 mkdir build
 cmake -S . -B build \
+    -DPython3_EXECUTABLE=$(which python) \
     -DCMAKE_BUILD_TYPE=Release \
     -DCMAKE_INSTALL_PREFIX=${PYTHON_SITE}
 cmake --build build -j -t pyoperon_pyoperon

It's a very minor change but if you prefer a PR I can do that too.

For the second problem (illegal instruction) I don't think we need to change anything. It can be addressed individually by providing a CMAKE_CXX_FLAGS definition to pyoperon within operon_install.sh. But hardware supporting AVX2 exists since 2013 so I wouldn't lower the default compile flags.

fnpdaml commented 1 year ago

Hi again @foolnotion,

In that case, I think it's best to pass "-march=native" globally so that all methods are natively compiled. However I haven't yet made complete sense of SRBench installation's inner chain of scripts - so:

What and where exactly to modify to enable this? (still in x64)
Also, I have access to a Power7+ server with plenty of memory and cores, (they're cheap 2nd hand :) where could something like "-mcpu=power7" / "-mtune=power7" (or equivalent) be passed so SRBench works natively on that target?

Many thanks and regards.

foolnotion commented 1 year ago

SRBench itself is just a benchmarking framework that hosts a number of methods which can be written in any programming language. Only the respective method authors can advise how to change compilation flags or whether or not their method is able to run on other architectures like PowerPC

It depends on the method itself. For Operon you can modify https://github.com/cavalab/srbench/blob/7377143f6545e3a906a25d9aa045c81b9d581ec6/experiment/methods/src/operon_install.sh#L102 and add something like -DCMAKE_CXX_FLAGS="-march=native"
Again, depends on the specific method. In theory, building a conda env on that server would install all the platform-specific toolchains which could then be used to compile the methods. Operon should work but ultimately it depends on the underlying libraries like Eigen or EVE (https://github.com/jfalcou/eve#current-roster-of-supported-instructions-sets)

fnpdaml commented 1 year ago

Hi again @foolnotion,

That's exactly what I had already done! But it doesn't seem to propagate to the used executable.

In fact, on all cmakes of "operon_install.sh":

#!/bin/bash

PYTHON_SITE=${CONDA_PREFIX}/lib/python`pkg-config --modversion python3`/site-packages

## aria-csv
git clone  https://github.com/AriaFallah/csv-parser csv-parser
mkdir -p ${CONDA_PREFIX}/include/aria-csv
pushd csv-parser
git checkout 544c764d0585c61d4c3bd3a023a825f3d7de1f31
cp parser.hpp ${CONDA_PREFIX}/include/aria-csv/parser.hpp
popd
rm -rf csv-parser

## vectorclass
git clone  https://github.com/vectorclass/version2.git vectorclass
mkdir -p ${CONDA_PREFIX}/include/vectorclass
pushd vectorclass
git checkout fee0601edd3c99845f4b7eeb697cff0385c686cb
cp *.h ${CONDA_PREFIX}/include/vectorclass/
popd
rm -rf vectorclass
cat > ${CONDA_PREFIX}/lib/pkgconfig/vectorclass.pc << EOF
prefix=${CONDA_PREFIX}/include/vectorclass
includedir=${CONDA_PREFIX}/include/vectorclass

Name: Vectorclass
Description: C++ class library for using the Single Instruction Multiple Data (SIMD) instructions to improve performance on modern microprocessors with the x86 or x86/64 instruction set.
Version: 2.01.04
Cflags: -I${CONDA_PREFIX}/include/vectorclass
EOF

## vstat
git clone  https://github.com/heal-research/vstat.git
pushd vstat
git checkout 9b48f0d021ec66df122be352ea928b6ceb4bca54
mkdir build
cmake -S . -B build \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_TESTING=OFF \
    -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX}
cmake --install build
popd
rm -rf vstat

## pratt-parser
git clone  https://github.com/foolnotion/pratt-parser-calculator.git
pushd pratt-parser-calculator
git checkout a15528b1a9acfe6adefeb41334bce43bdb8d578c
mkdir build
cmake -S . -B build \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_TESTING=OFF \
    -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX}
cmake --install build
popd
rm -rf pratt-parser-calculator

## fast-float
git clone  https://github.com/fastfloat/fast_float.git
pushd fast_float
git checkout 32d21dcecb404514f94fb58660b8029a4673c2c1
mkdir build
cmake -S . -B build \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DFASTLOAT_TEST=OFF \
    -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX}
cmake --install build
popd
rm -rf fast_float

## span-lite
git clone  https://github.com/martinmoene/span-lite.git
pushd span-lite
git checkout 8f7935ff4e502ee023990d356d6578b8293eda74
mkdir build
cmake -S . -B build \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DSPAN_LITE_OPT_BUILD_TESTS=OFF \
    -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX}
cmake --install build
popd
rm -rf span-lite

## robin_hood
git clone  https://github.com/martinus/robin-hood-hashing.git
pushd robin-hood-hashing
git checkout 9145f963d80d6a02f0f96a47758050a89184a3ed
mkdir build
cmake -S . -B build \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DRH_STANDALONE_PROJECT=OFF \
    -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX}
cmake --install build
popd
rm -rf robin-hood-hashing

# operon
git clone  https://github.com/heal-research/operon.git
pushd operon
git checkout d26dd0dcf16acb750da330b5112c63f2528af9a8
mkdir build
cmake -S . -B build \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_TESTING=OFF \
    -DBUILD_SHARED_LIBS=ON \
    -DBUILD_CLI_PROGRAMS=OFF \
    -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX} \
    -DCMAKE_PREFIX_PATH=${CONDA_PREFIX}/lib64/cmake
cmake --build build -j -t operon_operon
cmake --install build
popd
rm -rf operon

## pyoperon
git clone  https://github.com/heal-research/pyoperon.git
pushd pyoperon
git checkout 1c6eccd3e3fa212ebf611170ca2dfc45714c81de
mkdir build
cmake -S . -B build \
    -DPython3_EXECUTABLE=$(which python) \
    -DCMAKE_CXX_FLAGS="-march=native" \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=${PYTHON_SITE}
cmake --build build -j -t pyoperon_pyoperon
cmake --install build
popd
rm -rf pyoperon

I tried both running "operon_install.sh" directly and bash install.sh at the root of the repo. Both approaches seem to indeed compile something till the end, but when running, still the "Illegal instruction (core dumped)" error is returned.

Any ideas?
Also, after running "install.sh", where exactly are all methods executables stored?

Many thanks.

foolnotion commented 1 year ago

You are not compiling an executable but a shared library (python module). Dependencies are installed in $CONDA_PREFIX/lib (library files) and $CONDA_PREFIX/include (header files). The python module is installed in $CONDA_PREFIX/lib/python3.11/site-packages (replace 3.11 with the local python version).

I am not sure why you get the illegal instruction, there can be many reasons (even faulty hardware). You could start with a generic config (e.g. -march=x86-64) and check if it still crashes, then go up from there.

You can check the revision and timestamp of the operon module by doing:

$ python -c "import operon; print(operon.Version())"
operon rev. d26dd0d Release Linux-6.2.8 x86_64, timestamp 2023-03-31T14:37:31Z
single-precision build using eigen 3.3.9, ceres n/a, taskflow 3.3.0

This will show you if its using the latest-compiled one.

fnpdaml commented 1 year ago

Passing -march=x86-64 seems to make no difference - when running python -c "import operon; print(operon.Version())" I also get Illegal instruction (core dumped)

However, on my AVX2 laptop, I do get a fresh timestamp when I run ./operon_install.sh. So I compiled on my laptop for both x86-x64 and Ivybridge - my server arch - and copied the "pyoperon.cpython-37m-x86_64-linux-gnu.so" file to the IvyBridge server. Now, on the server, python -c "import operon; print(operon.Version())" runs correctly. But when trying to execute my test with python evaluate_model.py /pmlb/datasets/eq1/eq1.tsv -ml OperonRegressor -seed 42 -sym_data I continue to get "Illegal instruction (core dumped)".

Any further ideas?

Many thanks.

foolnotion commented 1 year ago

Last-ditch effort: try erasing any mention of avx2 in the CMakeFiles:

diff --git a/experiment/methods/src/operon_install.sh b/experiment/methods/src/operon_install.sh
index ff87c22..3dd3d05 100755
--- a/experiment/methods/src/operon_install.sh
+++ b/experiment/methods/src/operon_install.sh
@@ -98,6 +98,7 @@ rm -rf robin-hood-hashing
 git clone  https://github.com/heal-research/operon.git
 pushd operon
 git checkout d26dd0dcf16acb750da330b5112c63f2528af9a8
+sed -i 's/;-mavx2;-mfma//g' CMakeLists.txt
 mkdir build
 cmake -S . -B build \
     -DCMAKE_BUILD_TYPE=Release \
@@ -115,6 +116,7 @@ rm -rf operon
 git clone  https://github.com/heal-research/pyoperon.git
 pushd pyoperon
 git checkout 1c6eccd3e3fa212ebf611170ca2dfc45714c81de
+sed -i 's/;-mavx2;-mfma//g' CMakeLists.txt
 mkdir build
 cmake -S . -B build \
     -DPython3_EXECUTABLE=$(which python) \

fnpdaml commented 1 year ago

At first just those two lines didn't work, but putting that sed command of yours after every occurrence of "git checkout", solves it.

Big thanks @foolnotion! Really appreciated it.

foolnotion commented 1 year ago

It's a horrible solution and normally I'd just recommend updating to a newer pyoperon (easier to install too: pip install pyoperon), but I'm happy it works

fnpdaml commented 1 year ago

Note that doing pip install pyoperon and then running python -c "import pyoperon; print(pyoperon.Version())" also returns Illegal instruction (core dumped)

(whereas this "seded" python -c "import operon; print(operon.Version())" appears to work fine)

cavalab / srbench

Operon fails to launch from "evaluate_model.py" on Docker image #140