CIROH-UA / NGIAB-HPCInfra

NextGen In A Box: NextGen Generation Water Modeling Framework for Community Release (Singularity version)
0 stars 1 forks source link

Getting an Numpy not find error during run #12

Open benlee0423 opened 5 months ago

benlee0423 commented 5 months ago

Command to run the image

singularity run --bind /home/ubuntu/workspace/AWI_09_004:/ngen/ngen/data ciroh-ngen-singularity.sif "/ngen/ngen/data auto"

Command to run inside running image

mpirun --allow-run-as-root -n 2 /dmod/bin/ngen-parallel ./config/datastream.gpkg all ./config/datastream.gpkg all ./config/realization.json ./partitions_2.json 

Error message

Running NextGen model framework in parallel mode
Found paritions file! ./partitions_2.json
NGen Framework 0.1.0
NGen Framework 0.1.0
terminate called after throwing an instance of 'pybind11::error_already_set'
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ModuleNotFoundError: No module named 'numpy'
  what():  ModuleNotFoundError: No module named 'numpy'

Some path variables inside running image

Singularity> module show mpi
-------------------------------------------------------------------------------------------------------------------------------------------------------------
   /usr/share/modulefiles/mpi/openmpi-x86_64:
-------------------------------------------------------------------------------------------------------------------------------------------------------------
conflict("mpi")
prepend_path("PATH","/usr/lib64/openmpi/bin")
prepend_path("LD_LIBRARY_PATH","/usr/lib64/openmpi/lib")
prepend_path("PKG_CONFIG_PATH","/usr/lib64/openmpi/lib/pkgconfig")
prepend_path("MANPATH",":/usr/share/man/openmpi-x86_64")
setenv("MPI_BIN","/usr/lib64/openmpi/bin")
setenv("MPI_SYSCONFIG","/etc/openmpi-x86_64")
setenv("MPI_FORTRAN_MOD_DIR","/usr/lib64/gfortran/modules/openmpi")
setenv("MPI_INCLUDE","/usr/include/openmpi-x86_64")
setenv("MPI_LIB","/usr/lib64/openmpi/lib")
setenv("MPI_MAN","/usr/share/man/openmpi-x86_64")
setenv("MPI_PYTHON3_SITEARCH","/usr/lib64/python3.9/site-packages/openmpi")
setenv("MPI_COMPILER","openmpi-x86_64")
setenv("MPI_SUFFIX","_openmpi")
setenv("MPI_HOME","/usr/lib64/openmpi")
Singularity> cat /usr/share/modulefiles/mpi/openmpi-x86_64
#%Module 1.0
#
#  OpenMPI module for use with 'environment-modules' package:
#
conflict        mpi
prepend-path        PATH        /usr/lib64/openmpi/bin
prepend-path        LD_LIBRARY_PATH /usr/lib64/openmpi/lib
prepend-path        PKG_CONFIG_PATH /usr/lib64/openmpi/lib/pkgconfig
prepend-path        MANPATH     :/usr/share/man/openmpi-x86_64
setenv          MPI_BIN     /usr/lib64/openmpi/bin
setenv          MPI_SYSCONFIG   /etc/openmpi-x86_64
setenv          MPI_FORTRAN_MOD_DIR /usr/lib64/gfortran/modules/openmpi
setenv          MPI_INCLUDE /usr/include/openmpi-x86_64
setenv          MPI_LIB     /usr/lib64/openmpi/lib
setenv          MPI_MAN     /usr/share/man/openmpi-x86_64
setenv          MPI_PYTHON3_SITEARCH    /usr/lib64/python3.9/site-packages/openmpi
setenv          MPI_COMPILER    openmpi-x86_64
setenv          MPI_SUFFIX  _openmpi
setenv          MPI_HOME    /usr/lib64/openmpi

Linux:

$ uname -a
Linux ip-172-31-65-149 6.5.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Feb 15 15:27:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
benlee0423 commented 5 months ago

Obviously, this is due to the following path in module. Not sure how to fix it.

Singularity> ls /usr/lib64/python3.9/site-packages/openmpi
ls: cannot access '/usr/lib64/python3.9/site-packages/openmpi': No such file or directory
benlee0423 commented 5 months ago

-- NGen version: 0.1.0 -- Build configuration summary: -- Generator: Unix Makefiles -- Build type: -- System: Linux -- C Compiler: /usr/bin/cc -- C Flags: -- CXX Compiler: /usr/bin/c++ -- CXX Flags: -- Flags: -- NGEN_WITH_MPI: OFF -- NGEN_WITH_NETCDF: ON -- NGEN_WITH_SQLITE: ON -- NGEN_WITH_UDUNITS: ON -- NGEN_WITH_BMI_FORTRAN: ON -- NGEN_WITH_BMI_C: ON -- NGEN_WITH_PYTHON: ON -- NGEN_WITH_ROUTING: ON -- NGEN_WITH_TESTS: ON -- NGEN_WITH_COVERAGE: OFF -- NGEN_QUIET: ON -- Extern Models: -- NGEN_WITH_EXTERN_ALL: OFF -- NGEN_WITH_EXTERN_SLOTH: ON -- NGEN_WITH_EXTERN_TOPMODEL: ON -- NGEN_WITH_EXTERN_CFE: ON -- NGEN_WITH_EXTERN_PET: ON -- NGEN_WITH_EXTERN_NOAH_OWP_MODULAR: ON -- Environment summary: -- Boost: -- Version: 1.79.0 -- Include: /usr/include -- NetCDF: -- Version: 4.8.1 -- Library: /usr/lib64/libnetcdf.so -- Library (CXX): /usr/local/lib64/libnetcdf-cxx4.so -- Include: /usr/include -- Include (CXX): /usr/local/include -- Parallel: FALSE -- SQLite: -- Version: 3.34.1 -- Library: /usr/lib64/libsqlite3.so -- Include: /usr/include -- UDUNITS2: -- Library: /usr/lib64/libudunits2.so -- Include: /usr/include/udunits2
-- Fortran: -- BMI_FORTRAN_ISO_C_LIB_PATH: -- BMI_FORTRAN_ISO_C_LIB_NAME: OFF -- BMI_FORTRAN_ISO_C_LIB_DIR: OFF -- Python: -- Version: 3.9.18 -- Virtual Env: -- Executable: /usr/bin/python3.9 -- Interpreter Type: Python -- Site Library: /usr/lib/python3.9/site-packages -- Include: /usr/include/python3.9 -- Runtime Library: /usr/lib64 -- NumPy Version: 1.26.4 -- NumPy Include: /usr/local/lib64/python3.9/site-packages/numpy/core/include -- pybind11 Version: -- pybind11 Include: /ngen/extern/pybind11/include


-- Configuring done

benlee0423 commented 5 months ago

This looks like the similar issue raised by Trupesh. https://github.com/NOAA-OWP/ngen/issues/655

hellkite500 commented 5 months ago

Can you run

ldd /usr/bin/python3.9

In the container runtime?

benlee0423 commented 5 months ago
Singularity> ldd /usr/bin/python3.9
    linux-vdso.so.1 (0x00007ffc11394000)
    libpython3.9.so.1.0 => /lib64/libpython3.9.so.1.0 (0x00007f71c2c62000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f71c2a59000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f71c297e000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f71c2fd4000)
hellkite500 commented 5 months ago

So this isn't the same issue referenced above (noaa-owp/ngen#655) which is caused by the python interpreter being statically linked.

Best guess is a path problem. This looks suspicious:

-- NumPy Include: /usr/local/lib64/python3.9/site-packages/numpy/core/include

It looks like numpy is installed/found in

/usr/local/

Whereas the python path is

-- Site Library: /usr/lib/python3.9/site-packages

Can you simply open a python interpreter in the container and import numpy?

benlee0423 commented 5 months ago

Able to import numpy in python.

Singularity> python
Python 3.9.18 (main, Jan  4 2024, 00:00:00) 
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> array = numpy.array([1,2,3,4,5])
>>> print(array)
[1 2 3 4 5]

Also numpy is found in the location.

Singularity> ls /usr/local/lib64/python3.9/site-packages/numpy/core/include
numpy

python location

Singularity> whereis python
python: /usr/bin/python

no numpy in /usr/lib/python3.9/site-packages

Singularity> ls /usr/lib/python3.9/site-packages
__pycache__      distutils-precedence.pth  mockbuild                 pip-21.2.3.dist-info       python_dateutil-2.8.1-py3.9.egg-info  six.py
_distutils_hack  dnf                       packaging                 pkg_resources              setuptools
asciidocapi.py   dnf-plugins               packaging-20.9.dist-info  pyparsing-2.4.7.dist-info  setuptools-53.0.0.dist-info
dateutil         dnfpluginscore            pip                       pyparsing.py               six-1.15.0.dist-info
Singularity> ls -l /usr/bin/python
lrwxrwxrwx 1 root root 16 Mar 22 02:17 /usr/bin/python -> /usr/bin/python3
hellkite500 commented 5 months ago

Have you tried using a virtual environment for building and running ngen with?

benlee0423 commented 5 months ago

No virtual environment is used.

hellkite500 commented 5 months ago

What ngen commit are you building? A pybind update was merged a couple days ago.

noaa-owp/ngen#755

benlee0423 commented 5 months ago

I just built the image with ngen master branch. And, getting the same error.

benlee0423 commented 5 months ago

Getting the same error in docker build as well.

#17 48.05 terminate called after throwing an instance of 'pybind11::error_already_set'
#17 48.05   what():  ModuleNotFoundError: No module named 'numpy'
#17 48.18 [ 64%] Built target test_geojson
#17 48.18 CMake Error at /usr/local/lib64/python3.9/site-packages/cmake/data/share/cmake-3.28/Modules/GoogleTestAddTests.cmake:112 (message):
#17 48.18   Error running test executable.
#17 48.18 
#17 48.18     Path: '/ngen/ngen/cmake_build_serial/test/test_routing_pybind'
#17 48.18     Result: Subprocess aborted
#17 48.18     Output:
benlee0423 commented 5 months ago

Build and run are successful with commit id f91e2ea of ngen repo.