QueensGambit / CrazyAra

A Deep Learning UCI-Chess Variant Engine written in C++ & Python :parrot:
https://lichess.org/@/CrazyAra
GNU General Public License v3.0
248 stars 42 forks source link

"Bus error (core dumped)" on 'go' command #34

Closed thomas-daniels closed 3 years ago

thomas-daniels commented 5 years ago

I finally managed to run the executable (the Linux GPU version; Ubuntu 19.04) without startup errors and I should have put the models in the right place, but the UCI go makes this happen:

Bus error (core dumped)

Many other UCI commands result in an error. For example, ucinewgame results in a segfault:

Segmentation fault: 11

Stack trace:
  [bt] (0) ./libmxnet.so(+0xe7d769) [0x7fca155d6769]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x43f60) [0x7fca14229f60]
  [bt] (2) ./CrazyAra(MCTSAgent::clear_game_history()+0xd) [0x560b4de2fa1d]
  [bt] (3) ./CrazyAra(CrazyAra::uci_loop(int, char**)+0x84b) [0x560b4de0d5ab]
  [bt] (4) ./CrazyAra(main+0x4b) [0x560b4dd1c7fb]
  [bt] (5) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fca1420cb6b]
  [bt] (6) ./CrazyAra(_start+0x2a) [0x560b4ddb00ea]

And isready returns a big error:

info string json file: model/model-1.19246-0.603-symbol.json
info string Loading the model from model/model-1.19246-0.603-symbol.json
[23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
info string Loading the model parameters from model/model-1.19246-0.603-0223.params
terminate called after throwing an instance of 'dmlc::Error'
  what():  [23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/cpp-package/include/mxnet-cpp/operator.hpp:141: [23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/engine/threaded_engine.cc:328: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU
Stack trace:
  [bt] (0) ./CrazyAra(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x56357cc34483]
  [bt] (1) ./libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x527) [0x7f4792341d87]
  [bt] (2) ./libmxnet.so(mxnet::CopyFromTo(mxnet::NDArray const&, mxnet::NDArray const&, int, bool)+0x8b0) [0x7f47924b7100]
  [bt] (3) ./libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x9e) [0x7f47923e275e]
  [bt] (4) ./libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)+0x3cc) [0x7f47923ea07c]
  [bt] (5) ./libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0xa9f) [0x7f47923d611f]
  [bt] (6) ./libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x36a) [0x7f47923d6b6a]
  [bt] (7) ./libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x9f8) [0x7f47922940b8]
  [bt] (8) ./libmxnet.so(MXImperativeInvoke+0x4c) [0x7f4792294ebc]

Stack trace:
  [bt] (0) ./CrazyAra(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x56357cc34483]
  [bt] (1) ./CrazyAra(mxnet::cpp::Operator::Invoke(std::vector<mxnet::cpp::NDArray, std::allocator<mxnet::cpp::NDArray> >&)+0x583) [0x56357cc38fb3]
  [bt] (2) ./CrazyAra(mxnet::cpp::Operator::Invoke(mxnet::cpp::NDArray&)+0x9f) [0x56357cc3939f]
  [bt] (3) ./CrazyAra(mxnet::cpp::NDArray::Copy(mxnet::cpp::Context const&) const+0xbd) [0x56357cc6247d]
  [bt] (4) ./CrazyAra(NeuralNetAPI::load_parameters(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x592) [0x56357cc5cad2]
  [bt] (5) ./CrazyAra(NeuralNetAPI::NeuralNetAPI(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x433) [0x56357cc5e463]
  [bt] (6) ./CrazyAra(CrazyAra::is_ready()+0x809) [0x56357cc31449]
  [bt] (7) ./CrazyAra(CrazyAra::uci_loop(int, char**)+0x8ed) [0x56357cc3364d]
  [bt] (8) ./CrazyAra(main+0x4b) [0x56357cb427fb]

Aborted (core dumped)
QueensGambit commented 5 years ago

Hello @ProgramFOX, thank you for trying out the CrazyAra binary on Ubuntu 19.04.

It seems that MXNet was unable to detect the GPU on your system.

  what():  [23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/cpp-package/include/mxnet-cpp/operator.hpp:141: [23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/engine/threaded_engine.cc:328: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU

You can check if you get the same error if you enable CPU usage instead:

./CrazyAra
uci
setoption name Context value cpu
isready
go movetime 3000

I just added all shared object files to CrazyAra_0.6.0_Linux_CUDA.zip which libmxnet.so directly links to:

$ ldd libmxnet.so
    linux-vdso.so.1 (0x00007ffec27f7000)
    libnvToolsExt.so.1 => /usr/local/cuda-10.0/lib64/libnvToolsExt.so.1 (0x00007feb78664000)
    libopenblas.so.0 => /usr/lib/x86_64-linux-gnu/libopenblas.so.0 (0x00007feb763be000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007feb761b6000)
    libomp.so => /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/build/3rdparty/openmp/runtime/src/libomp.so (0x00007feb75eee000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007feb75ccf000)
    libcublas.so.10.0 => /usr/local/cuda-10.0/lib64/libcublas.so.10.0 (0x00007feb71739000)
    libcufft.so.10.0 => /usr/local/cuda-10.0/lib64/libcufft.so.10.0 (0x00007feb6b285000)
    libcusolver.so.10.0 => /usr/local/cuda-10.0/lib64/libcusolver.so.10.0 (0x00007feb62b9e000)
    libcurand.so.10.0 => /usr/local/cuda-10.0/lib64/libcurand.so.10.0 (0x00007feb5ea37000)
    libnvrtc.so.10.0 => /usr/local/cuda-10.0/lib64/libnvrtc.so.10.0 (0x00007feb5d41b000)
    libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007feb5c325000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007feb5c121000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007feb5bd98000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007feb5b9fa000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007feb5b7e2000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007feb5b3f1000)
    /lib64/ld-linux-x86-64.so.2 (0x00007feb87145000)
    libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007feb5b012000)
    libnvidia-fatbinaryloader.so.410.48 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.48 (0x00007feb5adc5000)

However, it is likely that further files are required.

For Windows (IntelMKL & CUDA) all necessary dll-files have been added to run CrazyAra stand-alone and successfully tested on an external system.

I will likely be able to test if it the GPU-binary works stand-alone for Linux in the upcoming weeks.

Building CrazyAra with its dependencies from source in Linux requires less time and effort compared to Windows. I can add a install.sh script to ease the compilation process in the future.

thomas-daniels commented 5 years ago

Enabling CPU usage followed by isready and go seems to work indeed!

For the GPU with CUDA, I tried with your downloadable shared object files and isready still errors, albeit a different one now:

isready
info string json file: model/model-1.19246-0.603-symbol.json
info string Loading the model from model/model-1.19246-0.603-symbol.json
[13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
info string Loading the model parameters from model/model-1.19246-0.603-0223.params
info string Bind successfull!
terminate called after throwing an instance of 'dmlc::Error'
  what():  [13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/cpp-package/include/mxnet-cpp/ndarray.hpp:237: Check failed: MXNDArrayWaitToRead(blob_ptr_->handle_) == 0 (-1 vs. 0) : [13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/operator/nn/./././im2col.cuh:321: Check failed: err == cudaSuccess (48 vs. 0) : Name: im2col_nd_gpu_kernel ErrStr:no kernel image is available for execution on the device
Stack trace:
  [bt] (0) ./CrazyAra(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x564326298483]
  [bt] (1) ./libmxnet.so(void mxnet::op::im2col<float>(mshadow::Stream<mshadow::gpu>*, float const*, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, float*)+0x2cb) [0x7f8de90e05db]
  [bt] (2) ./libmxnet.so(mxnet::op::ConvolutionOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xa67) [0x7f8de90e6b27]
  [bt] (3) ./libmxnet.so(void mxnet::op::ConvolutionCompute<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x17c) [0x7f8de90d76fc]
  [bt] (4) ./libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x6e) [0x7f8de6e4f06e]
  [bt] (5) ./libmxnet.so(+0xdf832a) [0x7f8de6e5532a]
  [bt] (6) ./libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x405) [0x7f8de6e36a25]
  [bt] (7) ./libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x11d) [0x7f8de6e39d8d]
  [bt] (8) ./libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f8de6e3a04e]

Stack trace:
  [bt] (0) ./CrazyAra(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x564326298483]
  [bt] (1) ./CrazyAra(mxnet::cpp::NDArray::WaitToRead() const+0xd9) [0x5643262c29c9]
  [bt] (2) ./CrazyAra(NeuralNetAPI::predict(float*, float&)+0x7a8) [0x5643262bfe38]
  [bt] (3) ./CrazyAra(NeuralNetAPI::infer_select_policy_from_planes()+0x69) [0x5643262c0439]
  [bt] (4) ./CrazyAra(NeuralNetAPI::NeuralNetAPI(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x443) [0x5643262c2473]
  [bt] (5) ./CrazyAra(CrazyAra::is_ready()+0x809) [0x564326295449]
  [bt] (6) ./CrazyAra(CrazyAra::uci_loop(int, char**)+0x8ed) [0x56432629764d]
  [bt] (7) ./CrazyAra(main+0x4b) [0x5643261a67fb]
  [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f8de5912b6b]

Aborted (core dumped)
QueensGambit commented 5 years ago

What GPU are you using on your system? Is it compatible with CUDA 10.0? According to these posts this might be an issue:

If you have an Intel-CPU you might try the MKL version. I will add the missing .so files there as well.

thomas-daniels commented 5 years ago

GeForce GTX 960. Should be compatible with CUDA as per their "Supported GPUs" page (although NVIDIA has some dead links on their website and I can't get to the page that's supposed to give more details about my GPU). Driver version is 418.56 as per nvidia-smi.

I also have CUDA 10.1 installed, alongside CUDA 10.0. Could this be a problem?

I have an Intel CPU, the MKL version crashes on startup:

symbol lookup error: ./CrazyAra: undefined symbol: MXSymbolInferShapeEx

But I'll try it again when you put all the .so files in the download.

thomas-daniels commented 5 years ago

I also have CUDA 10.1 installed, alongside CUDA 10.0. Could this be a problem?

To be more specific on this, according to nvidia-smi, I have 10.1 as CUDA driver version. Apparently that shouldn't cause any problems when using CUDA 10.0, though.

QueensGambit commented 5 years ago

I just added the remaining .so files to CrazyAra_0.6.0_Linux_MKL.zip.

Having different CUDA versions on the system might cause issues. I will upgrade the Linux version to CUDA 10.1 for newer releases.

You can also try if you are able to build the lc0 binary to see if your GPU supports CUDA 10.1.

thomas-daniels commented 5 years ago

For the MKL binary, I'm still getting the same undefined symbol.

As I cannot seem to find a way to get entirely rid of CUDA 10.1 in favor of CUDA 10.0, and attempts to downgrade my driver have failed too, I think I'll just wait until 10.1 is supported and see if I have any better luck.

QueensGambit commented 5 years ago

Alright, I will notify you when a binary for Linux with CUDA 10.1 support is available.

Regarding the MKL binary:

Did you ensure that the shared object files are within your LD_LIBRARY_PATH? e.g. in ~/.bashrc:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path-to-crazyara>
source ~/.bashrc
thomas-daniels commented 5 years ago

Yep, that's what I did. (If I did not, it would crash because it couldn't find libmxnet.so anyway.)

QueensGambit commented 5 years ago

Hmm, you're right that makes sense. It might be because I built both MXNet 1.4.1 and MXNet 1.5.0 on my system. I just recompiled the CrazyAra binary and updated CrazyAra_0.6.0_Linux_MKL.zip.

thomas-daniels commented 5 years ago

no more symbol error, but now I'm getting this

*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
QueensGambit commented 5 years ago

Good to know that the symbol error is fixed.

Unfortunately I'm unable to reproduce this crash on my system and I used the same commit hash db43fbf79cc4806ace9ca5a074d85e5fe93eb9f0 before compiling as the other release packages. Strangely, this didn't occur for the CPU-OpenBlas version that you tried. I'm assuming that this crash happens after calling the go command.

I added the executable in Debug mode CrazyAra_debug to CrazyAra_0.6.0_Linux_MKL.zip. Maybe this can give a more informative stack trace.

thomas-daniels commented 5 years ago

No, it happens on startup. The debug executable gives the exact same output without any stack trace.

QueensGambit commented 5 years ago

Thank you for the clarification. The only option I see left is that I do a clean re-install of MXNet-MKL and upload the whole CrazyAra_0.6.0_Linux_MKL-release package again.

If you want to run CrazyAra with IntelMKL support on Linux now you can try building it from source. Building MXNet with IntelMKL support requires less time compared to the CUDA version. On my laptop the building process took about three quarters of an hour.

QueensGambit commented 4 years ago

I just published release 0.7.0 with CUDA 10.1 support and added ./ into the library run path for the Linux binaries. This way the LD_LIBRARY_PATH needs not to be changed:

thomas-daniels commented 4 years ago

Cool, I'll try it out when I have some time (which can take a while especially because I have exams now :P)