accosmin-org / libnano

C++ numerical optimization and machine learning utilities using Eigen3
MIT License
4 stars 1 forks source link

double free or corruption using lbfgs solver #77

Closed foolnotion closed 1 year ago

foolnotion commented 1 year ago

Hi,

I'm probably doing something wrong but I tried to adapt the example from https://github.com/accosmin-org/libnano/blob/master/example/src/minimize.cpp with a different cost function

I've tried to debug this and it seems the double free occurs somewhere inside this call: https://github.com/accosmin-org/libnano/blob/4bd7ad94958399a6fd4412716238206b7fc250c2/src/lsearchk/morethuente.cpp#L276

gdb backtrace

#0  0x00007ffff5aa1a8c in __pthread_kill_implementation () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#1  0x00007ffff5a52c86 in raise () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#2  0x00007ffff5a3c8ba in abort () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#3  0x00007ffff5a3d5f5 in __libc_message.cold () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#4  0x00007ffff5aab735 in malloc_printerr () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#5  0x00007ffff5aad7b0 in _int_free () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#6  0x00007ffff5aafe53 in free () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#7  0x0000555555748a08 in Eigen::internal::aligned_free (ptr=0x1239e) at /nix/store/w23kfp00x2lgwyqyjk23v6cn69zvsfsa-eigen-3.4.0/include/eigen3/Eigen/src/Core/util/Memory.h:203
#8  Eigen::internal::conditional_aligned_free<true> (ptr=0x1239e) at /nix/store/w23kfp00x2lgwyqyjk23v6cn69zvsfsa-eigen-3.4.0/include/eigen3/Eigen/src/Core/util/Memory.h:259
#9  Eigen::internal::conditional_aligned_delete_auto<double, true> (ptr=0x1239e, size=<optimized out>)
    at /nix/store/w23kfp00x2lgwyqyjk23v6cn69zvsfsa-eigen-3.4.0/include/eigen3/Eigen/src/Core/util/Memory.h:446
#10 Eigen::DenseStorage<double, -1, -1, 1, 0>::~DenseStorage (this=<optimized out>)
    at /nix/store/w23kfp00x2lgwyqyjk23v6cn69zvsfsa-eigen-3.4.0/include/eigen3/Eigen/src/Core/DenseStorage.h:621
#11 Eigen::PlainObjectBase<Eigen::Matrix<double, -1, 1, 0, -1, 1> >::~PlainObjectBase (this=<optimized out>)
    at /nix/store/w23kfp00x2lgwyqyjk23v6cn69zvsfsa-eigen-3.4.0/include/eigen3/Eigen/src/Core/PlainObjectBase.h:98
#12 nano::solver_state_t::~solver_state_t (this=0x7ffff55fe0a8) at /build/source/include/nano/solver/state.h:23
#13 nano::lsearchk_t::get (this=0x7ffff007ccb0, state=..., descent=..., step_size=1) at /build/source/src/lsearchk.cpp:66
#14 0x00005555556de52b in nano::lsearch_t::get (this=0x7ffff55fe398, state=..., descent=...) at /build/source/include/nano/solver/lsearch.h:34
#15 nano::solver_lbfgs_t::do_minimize (this=0x7ffff55fe710, function=..., x0=...) at /build/source/src/solver/lbfgs.cpp:91
#16 0x00005555556c6469 in nano::solver_t::minimize (this=0x7ffff55fe710, function=..., x0=...) at /build/source/src/solver.cpp:128
#17 0x0000555555691fe7 in Operon::LBFGSOptimizer<Operon::DispatchTable<float> >::Optimize (this=this@entry=0x7ffff55fe880, target=std::span of length 500 = {...}, range=..., iterations=5, 
    summary=...) at /home/bogdb/src/operon-mdl-fix/include/operon/optimizer/optimizer.hpp:213
#18 0x000055555568e7fc in Operon::Evaluator<Operon::DispatchTable<float> >::operator() (this=0x7ffffffeeee0, ind=..., buf=std::span of length 500 = {...})
    at /home/bogdb/src/operon-mdl-fix/source/operators/evaluator.cpp:221
#19 0x00005555556024cd in std::__invoke_impl<std::vector<float, std::allocator<float> >, Operon::EvaluatorBase const&, Operon::Random::RomuTrio&, Operon::Individual&, std::span<float, 18446744073709551615ul>&> (__f=..., __args=..., __args=..., __args=...) at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/invoke.h:61
#20 std::__invoke<Operon::EvaluatorBase const&, Operon::Random::RomuTrio&, Operon::Individual&, std::span<float, 18446744073709551615ul>&> (__fn=..., __args=..., __args=..., __args=...)
    at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/invoke.h:96
#21 std::reference_wrapper<Operon::EvaluatorBase const>::operator()<Operon::Random::RomuTrio&, Operon::Individual&, std::span<float, 18446744073709551615ul>&> (this=0x555555945eb0, 
    __args=..., __args=..., __args=...) at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/refwrap.h:359
#22 Operon::MultiEvaluator::operator() (this=0x7ffffffee208, rng=..., ind=..., buf=std::span of length 500 = {...})
    at /home/bogdb/src/operon-mdl-fix/include/operon/operators/evaluator.hpp:228
#23 0x000055555566ad26 in Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}::operator()(unsigned long) const (this=this@entry=0x7ffff0010d48, i=i@entry=0) at /home/bogdb/src/operon-mdl-fix/source/algorithms/nsga2.cpp:151
#24 0x000055555566a85b in tf::detail::make_for_each_index_task<unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner>(unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner&&)::{lambda(tf::Runtime&)#1}::operator()(tf::Runtime&) (
    this=<optimized out>, rt=...) at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/algorithm/for_each.hpp:98
#25 std::__invoke_impl<void, tf::detail::make_for_each_index_task<unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner>(unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner&&)::{lambda(tf::Runtime&)#1}&, tf::Runtime&>(std::__invoke_other, tf::detail::make_for_each_index_task<unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner>(unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner&&)::{lambda(tf::Runtime&)#1}&, tf::Runtime&) (__f=..., __args=...)
    at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/invoke.h:61
--Type <RET> for more, q to quit, c to continue without paging--
#26 std::__invoke_r<void, tf::detail::make_for_each_index_task<unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner>(unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner&&)::{lambda(tf::Runtime&)#1}&, tf::Runtime&>(tf::detail::make_for_each_index_task<unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner>(unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner&&)::{lambda(tf::Runtime&)#1}&, tf::Runtime&) (__fn=..., __args=...)
    at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/invoke.h:111
#27 std::_Function_handler<void (tf::Runtime&), tf::detail::make_for_each_index_task<unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner>(unsigned long, unsigned long, unsigned long, Operon::NSGA2::Run(tf::Executor&, Operon::Random::RomuTrio&, std::function<void ()>)::$_0::operator()(tf::Subflow&) const::{lambda(unsigned long)#2}, tf::GuidedPartitioner&&)::{lambda(tf::Runtime&)#1}>::_M_invoke(std::_Any_data const&, tf::Runtime&) (__functor=..., __args=...) at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/std_function.h:290
#28 0x0000555555604b8e in std::function<void (tf::Runtime&)>::operator()(tf::Runtime&) const (this=0x7ffff0000ec0, __args=...)
    at /nix/store/6vwsydq4nzr1l8j7fyg5r61nknwq6w60-gcc-12.3.0/include/c++/12.3.0/bits/std_function.h:591
#29 tf::Executor::_invoke_static_task (this=0x7ffffffef180, worker=..., node=0x7ffff0000e10)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1730
#30 tf::Executor::_invoke (this=this@entry=0x7ffffffef180, worker=..., node=0x7ffff0000e10)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1549
#31 0x0000555555608c9e in tf::Executor::_corun_until<tf::Executor::_consume_graph(tf::Worker&, tf::Node*, tf::Graph&)::{lambda()#1}>(tf::Worker&, tf::Executor::_consume_graph(tf::Worker&, tf::Node*, tf::Graph&)::{lambda()#1}&&) (this=this@entry=0x7ffffffef180, w=..., stop_predicate=...)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1255
#32 0x0000555555608ba5 in tf::Executor::_consume_graph (this=this@entry=0x7ffffffef180, w=..., p=p@entry=0x555555992a60, g=...)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1811
#33 0x0000555555604971 in tf::Executor::_invoke_dynamic_task (this=0x7ffffffef180, w=..., node=0x555555992a60)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1750
#34 tf::Executor::_invoke (this=this@entry=0x7ffffffef180, worker=..., node=0x555555992a60)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1555
#35 0x0000555555603e1e in tf::Executor::_exploit_task (this=0x7ffffffef180, w=..., t=@0x7ffff55fee38: 0x555555992a60)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1322
#36 tf::Executor::_spawn(unsigned long)::{lambda(tf::Worker&, std::mutex&, std::condition_variable&, unsigned long&)#1}::operator()(tf::Worker&, std::mutex&, std::condition_variable&, unsigned long&) const (this=<optimized out>, w=..., mutex=..., cond=..., n=@0x7ffffffedf48: 255)
    at /nix/store/9h7pynl03xd5id75v2qf10gd1kyn8jbi-taskflow-3.6.0/include/taskflow/core/executor.hpp:1215
#37 0x00005555557d2143 in execute_native_thread_routine ()
#38 0x00007ffff5a9fdd4 in start_thread () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#39 0x00007ffff5b219b0 in clone3 () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6

I suspect some of the solver state might get invalid so I think some more robust state checking might be beneficial. Thanks in advance for any clues!

accosmin commented 1 year ago

Hello,

The code looks reasonable. I cannot figure it out without debugging.

I tried to build your branch with nix following the instructions from readme, but it failed. I will try again later on. So the crash happens when running some unit test? How can I reproduce it more specifically?

foolnotion commented 1 year ago

Thanks a lot for having a look. Sorry the README is a bit out of date. If you have nix it should be pretty straightforward.

  1. In order to consume libnano with nix, I added the following flake to the libnano root folder (needs to be added to the repo with git add)

    {
    description = "libnano dev";
    
    inputs.flake-utils.url = "github:numtide/flake-utils";
    inputs.nixpkgs.url = "github:nixos/nixpkgs";
    
    outputs = { self, flake-utils, nixpkgs }:
    flake-utils.lib.eachDefaultSystem (system:
      let
        pkgs = import nixpkgs {
          inherit system;
        };
        stdenv_ = pkgs.llvmPackages_16.stdenv;
      in rec {
        packages.default = stdenv_.mkDerivation {
          name = "nano";
          src = self;
          dontStrip = true;
    
          cmakeFlags = [ "-DCMAKE_BUILD_TYPE=Debug" ];
    
          nativeBuildInputs = with pkgs; [ cmake git ];
          buildInputs = with pkgs; [ eigen ];
        };
    
        devShells.default = stdenv_.mkDerivation {
          name = "nano-dev";
          hardeningDisable = [ "all" ];
          dontStrip = true;
          impureUseNativeOptimizations = true;
          nativeBuildInputs = with pkgs; [
            cmake
            clang_16
            clang-tools_16
            cppcheck
            gdb
            git
          ];
          buildInputs = packages.default.buildInputs;
        };
      });
    }
  2. I added the location of libnano from above (with that flake inside) to operon's flake (you have to change that to your local path) at line 10: https://github.com/heal-research/operon/blob/operon-mdl/flake.nix#L10

So if you clone the operon-mdl branch, make the change in the flake at line 10, and then run nix develop, you should get a dev env with everything included.

In that dev env, I am using the following commands:

cmake -S . -B build -Doperon_DEVELOPER_MODE=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-march=x86-64-v3 -fno-math-errno -g" -DUSE_SINGLE_PRECISION=ON -DUSE_CERES=OFF -DUSE_OPENLIBM=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_CLI_PROGRAMS=ON

Then I build one of the cli tools: cmake --build build -t operon_nsgp -j

For testing I am running the following command:

./build/cli/operon_nsgp --dataset ./data/Poly-10.csv --target Y --train 0:250 --iterations 5 --generations 5 --threads 1

If you update the sources of libnano, then the nano input of the Operon flake needs to be updated (in Operon's folder):

nix flake lock --update-input nano

Then you need to exit the dev env and run nix develop again (which will build the updated libnano), then delete the build folder and rebuild Operon.

Best, Bogdan

foolnotion commented 1 year ago

Note that I also tried to consume libnano directly with CMake add_subdirectory but had problems figuring out the exposed targets - that's why I resorted to the approach above.

accosmin commented 1 year ago

Thanks for the details!

It compiles now and I've managed to reproduce the issue with gdb. I will let you know of the outcome...

accosmin commented 1 year ago

The address sanitizer pointed out a free of a not-malloc memory address in the destructor solver_state_t, while the thread sanitizer and gdb show nonsensical outputs. That destructor is solid for 1000s of builds on 3 major platforms, btw.

So I suspect maybe linking with two different versions of some dependency, or Eigen being setup differently with libnano and operon...

FYI, compiling without -march=x86-64-v3 works just fine even with sanitizers turned on. Not sure why, still investigating...

foolnotion commented 1 year ago

Hi, good catch! Indeed, taking a closer look at the gdb output it seems that the crash occurs in Eigen's aligned_free, which kind of makes sense since different cflags might lead to different alignment (for example, without avx it should be 16-byte and with avx it should be 32-byte). The solution seems to be using the same cflags everywhere. Not really the ideal case for a library but I can work around that. Thanks again for taking a look at this, from my side this issue is resolved.