Closed thomas-daniels closed 3 years ago
Hello @ProgramFOX, thank you for trying out the CrazyAra binary on Ubuntu 19.04.
It seems that MXNet was unable to detect the GPU on your system.
what(): [23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/cpp-package/include/mxnet-cpp/operator.hpp:141: [23:18:56] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/engine/threaded_engine.cc:328: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU
You can check if you get the same error if you enable CPU usage instead:
./CrazyAra
uci
setoption name Context value cpu
isready
go movetime 3000
I just added all shared object files to CrazyAra_0.6.0_Linux_CUDA.zip which libmxnet.so
directly links to:
$ ldd libmxnet.so
linux-vdso.so.1 (0x00007ffec27f7000)
libnvToolsExt.so.1 => /usr/local/cuda-10.0/lib64/libnvToolsExt.so.1 (0x00007feb78664000)
libopenblas.so.0 => /usr/lib/x86_64-linux-gnu/libopenblas.so.0 (0x00007feb763be000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007feb761b6000)
libomp.so => /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/build/3rdparty/openmp/runtime/src/libomp.so (0x00007feb75eee000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007feb75ccf000)
libcublas.so.10.0 => /usr/local/cuda-10.0/lib64/libcublas.so.10.0 (0x00007feb71739000)
libcufft.so.10.0 => /usr/local/cuda-10.0/lib64/libcufft.so.10.0 (0x00007feb6b285000)
libcusolver.so.10.0 => /usr/local/cuda-10.0/lib64/libcusolver.so.10.0 (0x00007feb62b9e000)
libcurand.so.10.0 => /usr/local/cuda-10.0/lib64/libcurand.so.10.0 (0x00007feb5ea37000)
libnvrtc.so.10.0 => /usr/local/cuda-10.0/lib64/libnvrtc.so.10.0 (0x00007feb5d41b000)
libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007feb5c325000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007feb5c121000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007feb5bd98000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007feb5b9fa000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007feb5b7e2000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007feb5b3f1000)
/lib64/ld-linux-x86-64.so.2 (0x00007feb87145000)
libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007feb5b012000)
libnvidia-fatbinaryloader.so.410.48 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.48 (0x00007feb5adc5000)
However, it is likely that further files are required.
For Windows (IntelMKL & CUDA) all necessary dll-files have been added to run CrazyAra stand-alone and successfully tested on an external system.
I will likely be able to test if it the GPU-binary works stand-alone for Linux in the upcoming weeks.
Building CrazyAra with its dependencies from source in Linux requires less time and effort compared to Windows. I can add a install.sh
script to ease the compilation process in the future.
Enabling CPU usage followed by isready
and go
seems to work indeed!
For the GPU with CUDA, I tried with your downloadable shared object files and isready
still errors, albeit a different one now:
isready
info string json file: model/model-1.19246-0.603-symbol.json
info string Loading the model from model/model-1.19246-0.603-symbol.json
[13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.4.1. Attempting to upgrade...
[13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
info string Loading the model parameters from model/model-1.19246-0.603-0223.params
info string Bind successfull!
terminate called after throwing an instance of 'dmlc::Error'
what(): [13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/cpp-package/include/mxnet-cpp/ndarray.hpp:237: Check failed: MXNDArrayWaitToRead(blob_ptr_->handle_) == 0 (-1 vs. 0) : [13:48:46] /media/queensgambit/Volume/Deep_Learning/libraries/mxnet/src/operator/nn/./././im2col.cuh:321: Check failed: err == cudaSuccess (48 vs. 0) : Name: im2col_nd_gpu_kernel ErrStr:no kernel image is available for execution on the device
Stack trace:
[bt] (0) ./CrazyAra(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x564326298483]
[bt] (1) ./libmxnet.so(void mxnet::op::im2col<float>(mshadow::Stream<mshadow::gpu>*, float const*, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, mxnet::TShape const&, float*)+0x2cb) [0x7f8de90e05db]
[bt] (2) ./libmxnet.so(mxnet::op::ConvolutionOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xa67) [0x7f8de90e6b27]
[bt] (3) ./libmxnet.so(void mxnet::op::ConvolutionCompute<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x17c) [0x7f8de90d76fc]
[bt] (4) ./libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x6e) [0x7f8de6e4f06e]
[bt] (5) ./libmxnet.so(+0xdf832a) [0x7f8de6e5532a]
[bt] (6) ./libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x405) [0x7f8de6e36a25]
[bt] (7) ./libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x11d) [0x7f8de6e39d8d]
[bt] (8) ./libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f8de6e3a04e]
Stack trace:
[bt] (0) ./CrazyAra(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x564326298483]
[bt] (1) ./CrazyAra(mxnet::cpp::NDArray::WaitToRead() const+0xd9) [0x5643262c29c9]
[bt] (2) ./CrazyAra(NeuralNetAPI::predict(float*, float&)+0x7a8) [0x5643262bfe38]
[bt] (3) ./CrazyAra(NeuralNetAPI::infer_select_policy_from_planes()+0x69) [0x5643262c0439]
[bt] (4) ./CrazyAra(NeuralNetAPI::NeuralNetAPI(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x443) [0x5643262c2473]
[bt] (5) ./CrazyAra(CrazyAra::is_ready()+0x809) [0x564326295449]
[bt] (6) ./CrazyAra(CrazyAra::uci_loop(int, char**)+0x8ed) [0x56432629764d]
[bt] (7) ./CrazyAra(main+0x4b) [0x5643261a67fb]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f8de5912b6b]
Aborted (core dumped)
What GPU are you using on your system? Is it compatible with CUDA 10.0? According to these posts this might be an issue:
If you have an Intel-CPU you might try the MKL version. I will add the missing .so files there as well.
GeForce GTX 960. Should be compatible with CUDA as per their "Supported GPUs" page (although NVIDIA has some dead links on their website and I can't get to the page that's supposed to give more details about my GPU). Driver version is 418.56 as per nvidia-smi
.
I also have CUDA 10.1 installed, alongside CUDA 10.0. Could this be a problem?
I have an Intel CPU, the MKL version crashes on startup:
symbol lookup error: ./CrazyAra: undefined symbol: MXSymbolInferShapeEx
But I'll try it again when you put all the .so files in the download.
I also have CUDA 10.1 installed, alongside CUDA 10.0. Could this be a problem?
To be more specific on this, according to nvidia-smi, I have 10.1 as CUDA driver version. Apparently that shouldn't cause any problems when using CUDA 10.0, though.
I just added the remaining .so files to CrazyAra_0.6.0_Linux_MKL.zip.
Having different CUDA versions on the system might cause issues. I will upgrade the Linux version to CUDA 10.1 for newer releases.
You can also try if you are able to build the lc0 binary to see if your GPU supports CUDA 10.1.
For the MKL binary, I'm still getting the same undefined symbol.
As I cannot seem to find a way to get entirely rid of CUDA 10.1 in favor of CUDA 10.0, and attempts to downgrade my driver have failed too, I think I'll just wait until 10.1 is supported and see if I have any better luck.
Alright, I will notify you when a binary for Linux with CUDA 10.1 support is available.
Regarding the MKL binary:
Did you ensure that the shared object files are within your LD_LIBRARY_PATH
?
e.g. in ~/.bashrc
:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path-to-crazyara>
source ~/.bashrc
Yep, that's what I did. (If I did not, it would crash because it couldn't find libmxnet.so anyway.)
Hmm, you're right that makes sense.
It might be because I built both MXNet 1.4.1 and MXNet 1.5.0 on my system.
I just recompiled the CrazyAra
binary and updated CrazyAra_0.6.0_Linux_MKL.zip.
no more symbol error, but now I'm getting this
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
Good to know that the symbol error is fixed.
Unfortunately I'm unable to reproduce this crash on my system and I used the same commit hash db43fbf79cc4806ace9ca5a074d85e5fe93eb9f0
before compiling as the other release packages.
Strangely, this didn't occur for the CPU-OpenBlas version that you tried.
I'm assuming that this crash happens after calling the go
command.
I added the executable in Debug mode CrazyAra_debug
to CrazyAra_0.6.0_Linux_MKL.zip.
Maybe this can give a more informative stack trace.
No, it happens on startup. The debug executable gives the exact same output without any stack trace.
Thank you for the clarification. The only option I see left is that I do a clean re-install of MXNet-MKL and upload the whole CrazyAra_0.6.0_Linux_MKL-release package again.
If you want to run CrazyAra with IntelMKL support on Linux now you can try building it from source. Building MXNet with IntelMKL support requires less time compared to the CUDA version. On my laptop the building process took about three quarters of an hour.
I just published release 0.7.0 with CUDA 10.1 support and added ./
into the library run path for the Linux binaries. This way the LD_LIBRARY_PATH
needs not to be changed:
Cool, I'll try it out when I have some time (which can take a while especially because I have exams now :P)
I finally managed to run the executable (the Linux GPU version; Ubuntu 19.04) without startup errors and I should have put the models in the right place, but the UCI
go
makes this happen:Many other UCI commands result in an error. For example,
ucinewgame
results in a segfault:And
isready
returns a big error: