Closed foolnotion closed 1 year ago
Hello,
The code looks reasonable. I cannot figure it out without debugging.
I tried to build your branch with nix following the instructions from readme, but it failed. I will try again later on. So the crash happens when running some unit test? How can I reproduce it more specifically?
Thanks a lot for having a look. Sorry the README is a bit out of date. If you have nix it should be pretty straightforward.
In order to consume libnano with nix, I added the following flake to the libnano root folder (needs to be added to the repo with git add
)
{
description = "libnano dev";
inputs.flake-utils.url = "github:numtide/flake-utils";
inputs.nixpkgs.url = "github:nixos/nixpkgs";
outputs = { self, flake-utils, nixpkgs }:
flake-utils.lib.eachDefaultSystem (system:
let
pkgs = import nixpkgs {
inherit system;
};
stdenv_ = pkgs.llvmPackages_16.stdenv;
in rec {
packages.default = stdenv_.mkDerivation {
name = "nano";
src = self;
dontStrip = true;
cmakeFlags = [ "-DCMAKE_BUILD_TYPE=Debug" ];
nativeBuildInputs = with pkgs; [ cmake git ];
buildInputs = with pkgs; [ eigen ];
};
devShells.default = stdenv_.mkDerivation {
name = "nano-dev";
hardeningDisable = [ "all" ];
dontStrip = true;
impureUseNativeOptimizations = true;
nativeBuildInputs = with pkgs; [
cmake
clang_16
clang-tools_16
cppcheck
gdb
git
];
buildInputs = packages.default.buildInputs;
};
});
}
I added the location of libnano from above (with that flake inside) to operon's flake (you have to change that to your local path) at line 10: https://github.com/heal-research/operon/blob/operon-mdl/flake.nix#L10
So if you clone the operon-mdl
branch, make the change in the flake at line 10, and then run nix develop
, you should get a dev env with everything included.
In that dev env, I am using the following commands:
cmake -S . -B build -Doperon_DEVELOPER_MODE=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-march=x86-64-v3 -fno-math-errno -g" -DUSE_SINGLE_PRECISION=ON -DUSE_CERES=OFF -DUSE_OPENLIBM=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_CLI_PROGRAMS=ON
Then I build one of the cli tools: cmake --build build -t operon_nsgp -j
For testing I am running the following command:
./build/cli/operon_nsgp --dataset ./data/Poly-10.csv --target Y --train 0:250 --iterations 5 --generations 5 --threads 1
If you update the sources of libnano
, then the nano input of the Operon flake needs to be updated (in Operon's folder):
nix flake lock --update-input nano
Then you need to exit the dev env and run nix develop
again (which will build the updated libnano
), then delete the build folder and rebuild Operon.
Best, Bogdan
Note that I also tried to consume libnano
directly with CMake add_subdirectory
but had problems figuring out the exposed targets - that's why I resorted to the approach above.
Thanks for the details!
It compiles now and I've managed to reproduce the issue with gdb. I will let you know of the outcome...
The address sanitizer pointed out a free of a not-malloc memory address in the destructor solver_state_t
, while the thread sanitizer and gdb show nonsensical outputs. That destructor is solid for 1000s of builds on 3 major platforms, btw.
So I suspect maybe linking with two different versions of some dependency, or Eigen being setup differently with libnano and operon...
FYI, compiling without -march=x86-64-v3
works just fine even with sanitizers turned on. Not sure why, still investigating...
Hi, good catch! Indeed, taking a closer look at the gdb output it seems that the crash occurs in Eigen's aligned_free
, which kind of makes sense since different cflags might lead to different alignment (for example, without avx it should be 16-byte and with avx it should be 32-byte). The solution seems to be using the same cflags everywhere. Not really the ideal case for a library but I can work around that. Thanks again for taking a look at this, from my side this issue is resolved.
Hi,
I'm probably doing something wrong but I tried to adapt the example from https://github.com/accosmin-org/libnano/blob/master/example/src/minimize.cpp with a different cost function
definition: https://github.com/heal-research/operon/blob/operon-mdl/include/operon/optimizer/bfgs_cost_function.hpp
usage: https://github.com/heal-research/operon/blob/operon-mdl/include/operon/optimizer/optimizer.hpp#L194
I've tried to debug this and it seems the double free occurs somewhere inside this call: https://github.com/accosmin-org/libnano/blob/4bd7ad94958399a6fd4412716238206b7fc250c2/src/lsearchk/morethuente.cpp#L276
gdb backtrace
I suspect some of the solver state might get invalid so I think some more robust state checking might be beneficial. Thanks in advance for any clues!