davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++
http://dlib.net
Boost Software License 1.0
13.25k stars 3.35k forks source link

[Bug]: Compiling unit tests on aarch64 fails #2947

Closed penguinpee closed 2 months ago

penguinpee commented 3 months ago

What Operating System(s) are you seeing this problem on?

Linux (aarch64)

dlib version

19.24.4

Python version

3.12

Compiler

GCC 14

Expected Behavior

Unit tests should compile without failure on aarch64. The library compiles fine with exactly the same settings as used for compiling the unit tests. On x86_64 both library and unit tests successfully compile.

Current Behavior

Compilation fails with:

gmake[2]: Entering directory '/builddir/build/BUILD/dlib-19.24.4/dlib/test/redhat-linux-build'
[ 50%] Building CXX object examples/examples_build/CMakeFiles/logger_ex_2.dir/logger_ex_2.cpp.o
cd /builddir/build/BUILD/dlib-19.24.4/dlib/test/redhat-linux-build/examples/examples_build && /usr/bin/g++  -I/builddir/build/BUILD/dlib-19.24.4/dlib/.. -I/usr/include/ffmpeg -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -DNDEBUG   -Wno-unused-but-set-variable -Wno-comment -Wno-unused-parameter -W -Wall -Wextra -Wpedantic -Werror -fdiagnostics-color=always -Wno-unused-function -Wno-strict-overflow -Wno-maybe-uninitialized -I/usr/include -DHWY_SHARED_DEFINE -I/usr/include/ffmpeg -DDLIB_JPEG_SUPPORT -DDLIB_USE_BLAS -DDLIB_USE_LAPACK -DDLIB_PNG_SUPPORT -DDLIB_WEBP_SUPPORT -DDLIB_JXL_SUPPORT -DDLIB_USE_FFMPEG -Wreturn-type -MD -MT examples/examples_build/CMakeFiles/logger_ex_2.dir/logger_ex_2.cpp.o -MF CMakeFiles/logger_ex_2.dir/logger_ex_2.cpp.o.d -o CMakeFiles/logger_ex_2.dir/logger_ex_2.cpp.o -c /builddir/build/BUILD/dlib-19.24.4/examples/logger_ex_2.cpp
In file included from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix/../algs.h:122,
                 from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix/matrix_exp.h:6,
                 from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix/matrix.h:6,
                 from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix.h:6,
                 from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/optimization_search_strategies.h:8,
                 from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/optimization.h:9,
                 from /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization.h:6,
                 from /builddir/build/BUILD/dlib-19.24.4/examples/least_squares_ex.cpp:13:
In member function ‘dlib::memory_manager_stateless_kernel_1<double>::deallocate_array(double*)’,
    inlined from ‘dlib::row_major_layout::layout<double, 0l, 0l, dlib::memory_manager_stateless_kernel_1<char>, 5>::~layout()’ at /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix/matrix_data_layout.h:475:42,
    inlined from ‘dlib::matrix<double, 0l, 0l, dlib::memory_manager_stateless_kernel_1<char>, dlib::row_major_layout>::~matrix()’ at /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix/matrix.h:1013:11,
    inlined from ‘main’ at /builddir/build/BUILD/dlib-19.24.4/examples/least_squares_ex.cpp:94:49:
/builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/optimization/../matrix/../memory_manager_stateless/memory_manager_stateless_kernel_1.h:61:17: error: ‘operator delete[](void*)’ called on unallocated object ‘params’ [-Werror=free-nonheap-object]
   61 |                 delete [] item;
      |                 ^~~~~~~~~~~~~~
/builddir/build/BUILD/dlib-19.24.4/examples/least_squares_ex.cpp: In function ‘main’:
/builddir/build/BUILD/dlib-19.24.4/examples/least_squares_ex.cpp:94:32: note: declared here
   94 |         const parameter_vector params = 10*randm(3,1);
      |                                ^~~~~~
cc1plus: all warnings being treated as errors
gmake[2]: *** [examples/examples_build/CMakeFiles/least_squares_ex.dir/build.make:79: examples/examples_build/CMakeFiles/least_squares_ex.dir/least_squares_ex.cpp.o] Error 1
gmake[2]: Leaving directory '/builddir/build/BUILD/dlib-19.24.4/dlib/test/redhat-linux-build'
gmake[1]: *** [CMakeFiles/Makefile2:2104: examples/examples_build/CMakeFiles/least_squares_ex.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
In member function ‘allocate_array’,
    inlined from ‘set_max_size’ at /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/svm/../matrix/../array/array_kernel.h:438:59,
    inlined from ‘push_back.constprop’ at /builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/svm/../matrix/../array/array_kernel.h:769:30:
/builddir/build/BUILD/dlib-19.24.4/dlib/../dlib/svm/../memory_manager_stateless/memory_manager_stateless_kernel_1.h:54:24: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
   54 |                 return new T[size];
      |                        ^
/usr/include/c++/14/new: In member function ‘push_back.constprop’:
/usr/include/c++/14/new:133:26: note: in a call to allocation function ‘operator new []’ declared here
  133 | _GLIBCXX_NODISCARD void* operator new[](std::size_t) _GLIBCXX_THROW (std::bad_alloc)
      |                          ^
gmake[2]: Leaving directory '/builddir/build/BUILD/dlib-19.24.4/dlib/test/redhat-linux-build'

Steps to Reproduce

pushd dlib/test

CFLAGS='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer '
export CFLAGS

CXXFLAGS='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer '
export CXXFLAGS

LDFLAGS='-Wl,-z,relro -Wl,--as-needed   -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld-errors -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -Wl,--build-id=sha1 -specs=/usr/lib/rpm/redhat/redhat-package-notes '
export LDFLAGS

/usr/bin/cmake -S . -B redhat-linux-build -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON -DDLIB_WEBP_SUPPORT:BOOL=ON

### Anything else?

Output from `cmake` (configuration):

```make
-- The C compiler identification is GNU 14.0.1
-- The CXX compiler identification is GNU 14.0.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Using CMake version: 3.28.3
-- Compiling dlib version: 19.24.4
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found X11: /usr/include   
-- Looking for XOpenDisplay in /usr/lib64/libX11.so
-- Looking for XOpenDisplay in /usr/lib64/libX11.so - found
-- Looking for gethostbyname
-- Looking for gethostbyname - found
-- Looking for connect
-- Looking for connect - found
-- Looking for remove
-- Looking for remove - found
-- Looking for shmat
-- Looking for shmat - found
-- Found system copy of libpng: /usr/lib64/libpng.so;/usr/lib64/libz.so
-- Found system copy of libjpeg: /usr/lib64/libjpeg.so
-- Found WebP: /usr/lib64/libwebp.so  
-- Searching for JPEG XL
-- Found PkgConfig: /usr/bin/pkg-config (found version "2.1.0") 
-- Checking for modules 'libjxl;libjxl_cms;libjxl_threads'
--   Found libjxl, version 0.10.2
--   Found libjxl_cms, version 0.10.2
--   Found libjxl_threads, version 0.10.2
-- Found libjxl via pkg-config in `/usr/lib64`
-- Searching for BLAS and LAPACK
-- Searching for BLAS and LAPACK
-- Checking for module 'cblas'
--   Found cblas, version 3.12.0
-- Checking for module 'lapack'
--   Found lapack, version 3.12.0
-- Looking for cblas_ddot
-- Looking for cblas_ddot - found
-- Found BLAS and LAPACK via pkg-config
CMake Warning (dev) at /builddir/build/BUILD/dlib-19.24.4/dlib/CMakeLists.txt:652 (find_package):
  Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
  --help-policy CMP0146" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.
This warning is for project developers.  Use -Wno-dev to suppress it.
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "7.5")
-- Found CUDA, but CMake was unable to find the cuBLAS libraries that should be part of every basic CUDA install. Your CUDA install is somehow broken or incomplete. Since cuBLAS is required for dlib to use CUDA we won't use CUDA.
-- DID NOT FIND CUDA
-- Disabling CUDA support for dlib.  DLIB WILL NOT USE CUDA
-- Searching for FFMPEG/LIBAV
-- Checking for modules 'libavdevice;libavfilter;libavformat;libavcodec;libswresample;libswscale;libavutil'
--   Found libavdevice, version 60.3.100
--   Found libavfilter, version 9.12.100
--   Found libavformat, version 60.16.100
--   Found libavcodec, version 60.31.102
--   Found libswresample, version 4.12.100
--   Found libswscale, version 7.5.100
--   Found libavutil, version 58.29.100
-- Found FFMPEG/LIBAV via pkg-config in `/usr/lib64`
OpenCV not found, so we won't build the webcam_face_pose_ex example.
-- Configuring done (13.6s)
-- Generating done (1.2s)
CMake Warning:
  Manually-specified variables were not used by the project:
    CMAKE_C_FLAGS_RELEASE
    CMAKE_Fortran_FLAGS_RELEASE
    CMAKE_INSTALL_DO_STRIP
    INCLUDE_INSTALL_DIR
    LIB_INSTALL_DIR
    LIB_SUFFIX
    SHARE_INSTALL_PREFIX
    SYSCONF_INSTALL_DIR
-- Build files have been written to: /builddir/build/BUILD/dlib-19.24.4/dlib/test/redhat-linux-build

If above snippets are insufficient, I can provide the full log or a link to it.

penguinpee commented 3 months ago

At pretty much the same point the compilation also fails on s390x.

davisking commented 2 months ago

Thanks. That's definitely a compiler bug, that code is not going to try and allocate some huge block of memory. I just pushed a cmake change to suppress it so should be good for you now. Let me know if it's not.

penguinpee commented 2 months ago

That would be b9355f0 and 87da2959?

davisking commented 2 months ago

Yep

penguinpee commented 2 months ago

I applied both commits on top of 19.24.4. That fixes the test compilation for aarch64 and s390x. However, on s390x I see three failing tests: test_active_learning, test_cca and test_correlation_tracker, and lots of "Parameter N to routine FOO was incorrect" messages in the log.

Is s390x supported / tested at all? I know we had it built before. But that was without running tests. Not sure about the reasons, though.

davisking commented 2 months ago

It ought to work anywhere. But I don't know if anyone is using it on s390x. Maybe the tests are just overly tight numerically? I can't say given the information I've got.

penguinpee commented 2 months ago

You can find the full log of the first scratch build after applying the two commits fixing the build at https://kojipkgs.fedoraproject.org//work/tasks/1493/116961493/build.log

Since this is a scratch build it will be cleaned up after some time. But I can always trigger another build if needed.

After that I tried with disabling the failing tests. But that resulted in more failing tests and segfaults. I can provide you with the logs of those as well. If you prefere tracking this issue separately, let me know and I open another one.

davisking commented 2 months ago

Yeah IDK. I'm not going to be able to debug this. You should look into it and send us a PR if there really is a bug in dlib here :)

penguinpee commented 2 months ago

I understand. Should I find the time and the means for debugging and fixing the build issue, I certainly will provide a PR. For now we keep s390x arch disabled at our end.