ShARC Singularity Caffe images experiencing CUDA error

twinkarma commented 6 years ago

This is the output from test mnist model which failed on the Conv layer but other users have reported failing on ReLU as well for different models.

I1212 15:30:50.001453  7514 layer_factory.hpp:77] Creating layer mnist
I1212 15:30:50.025658  7514 db_lmdb.cpp:35] Opened lmdb data/mnist_train_lmdb
I1212 15:30:50.026167  7514 net.cpp:84] Creating Layer mnist
I1212 15:30:50.026208  7514 net.cpp:380] mnist -> data
I1212 15:30:50.026262  7514 net.cpp:380] mnist -> label
I1212 15:30:50.028520  7514 data_layer.cpp:45] output data size: 64,1,28,28
I1212 15:30:50.032743  7514 net.cpp:122] Setting up mnist
I1212 15:30:50.032812  7514 net.cpp:129] Top shape: 64 1 28 28 (50176)
I1212 15:30:50.032842  7514 net.cpp:129] Top shape: 64 (64)
I1212 15:30:50.032858  7514 net.cpp:137] Memory required for data: 200960
I1212 15:30:50.032953  7514 layer_factory.hpp:77] Creating layer conv1
I1212 15:30:50.033026  7514 net.cpp:84] Creating Layer conv1
I1212 15:30:50.033052  7514 net.cpp:406] conv1 <- data
I1212 15:30:50.033084  7514 net.cpp:380] conv1 -> conv1
E1212 15:30:50.196197  7541 common.cpp:114] Cannot create Cublas handle. Cublas won't be available.
F1212 15:30:50.544200  7514 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
    @     0x2b822f3e95cd  google::LogMessage::Fail()
    @     0x2b822f3eb433  google::LogMessage::SendToLog()
    @     0x2b822f3e915b  google::LogMessage::Flush()
    @     0x2b822f3ebe1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x2b822e706fab  caffe::CuDNNConvolutionLayer<>::LayerSetUp()
    @     0x2b822e66866c  caffe::Net<>::Init()
    @     0x2b822e66ad5e  caffe::Net<>::Net()
    @     0x2b822e649c45  caffe::Solver<>::InitTrainNet()
    @     0x2b822e64b0b5  caffe::Solver<>::Init()
    @     0x2b822e64b3cf  caffe::Solver<>::Solver()
    @     0x2b822e62ed51  caffe::Creator_SGDSolver<>()
    @           0x416dac  caffe::SolverRegistry<>::CreateSolver()
    @           0x40e67d  train()
    @           0x40b8a3  main
    @     0x2b823058c830  __libc_start_main
    @           0x40c249  _start
    @              (nil)  (unknown)
Aborted

twinkarma commented 6 years ago

@willfurnass @mondus Looks like it's just stopped working suddenly, maybe it's due to driver updates? Drivers should have backwards compatibility to CUDA code though.

willfurnass commented 6 years ago

@twinkarma How can one reproduce this issue?

twinkarma commented 6 years ago

@willfurnass looks like it'll fail on any model. I've just updated my local machine to driver version 384.90 and now getting the same errors.

willfurnass commented 6 years ago

@twinkarma Will enough people encounter the issue to warrant notifying the HPC and/or RSE community mailing lists?

twinkarma commented 6 years ago

@willfurnass I've gone and installed singularity 2.4.2 on my machine and looks like the --nv flag is properly working now and it seems to also fix this problem. I'm recommending an update of singularity version, removing the config files and recommending everyone to use the --nv flag instead.

twinkarma commented 6 years ago

... After testing that it works

twinkarma commented 6 years ago

@willfurnass I just mean updating the documentation, definitely no need for a group wide email about it.

willfurnass commented 6 years ago

@AnthonyBrookfield: FYI

willfurnass commented 6 years ago

Great. It'll be good to be using the same mechanism as other Singularity users to make GPUs usable in containers.

For reference, here's Greg Kurtzer's explaination of the --nv flag:

Singularity (now in development branch) supports the --nv flag which will find the relevant Nvidia/Cuda libraries on your host via the ld.so.cache, and will bind those into a library location within the container automatically. It will also make sure those libraries are linked, as necessary, by any Cuda applications that require it. Additionally, with the device tree bound into the container, all of the components for appropriate application runtime support are present, and as has been tested, it just works. :)

anthonybrookfield commented 6 years ago

I've got rpms for 2.4.2 built and ready to go... Do you want to test on a GPU node before I roll them out? (I'll be removing the bind path options for nvlib and nvbin from the singularity config file)

twinkarma commented 6 years ago

@anthonybrookfield Yes I'd like to test it first before we have to roll it out.

twinkarma commented 6 years ago

@anthonybrookfield Any updates on this?

anthonybrookfield commented 6 years ago

I've put the new version on the flybrain nodes (still using the old config file though)

twinkarma commented 6 years ago

I'm getting the following error:


WARNING: Skipping user bind, non existent bind point (file) in container: '/bin/nvidia-smi'

Looks like we have to enable overlay (#singularityware/singularity/624) in the config file:

enable overlay = yes
#OR try and silently ignore if can't bind
enable overlay = try

twinkarma commented 6 years ago

@anthonybrookfield Hi Anthony, any updates on this?

twinkarma commented 6 years ago

@anthonybrookfield Hi Anthony, any way we can move forwards with this?

anthonybrookfield commented 6 years ago

Needed to track down the reason for the scary warning in the config file when enabling overlay.

note: currently disabled because RHEL7 kernel crashes with it... :(

Looks like it should be fine for RHEL >7.2 (https://github.com/singularityware/singularity/issues/228) so I've updated the config file to enable it. The flybrain nodes should now have "enable overlay = try", the rest of the cluster will pick up the change in the next couple of hours.

twinkarma commented 6 years ago

Sorry for the delay, just had a chance to test this. Looks like it works! We just need to remove the /nvbin and /nvlib binding points from the images as it's now not necessary and interfere with the library that Singularity imports automatically.

I will get to work on updating the docs.

twinkarma commented 6 years ago

Remove the /nvbin and /nvlib binding points from the configs rather.

anthonybrookfield commented 6 years ago

OK - let me know when the docs are ready and then I'll remove the configs for those bind points. should be a quick change to make. Do you know if there are many users using the old method who will need informing of the change?

twinkarma commented 6 years ago

I've updated the singularity instructions now. See PR rcgsheffield/sheffield_hpc#776

anthonybrookfield commented 6 years ago

singularity.conf file now updated. I'd suggest maybe posting to the HPC discussion group just to warn people that the old method won't work any more and pointing at the docs...

twinkarma commented 6 years ago

Ok I'll do that by today.

RSE-Sheffield / GPUComputing

ShARC Singularity Caffe images experiencing CUDA error #7

note: currently disabled because RHEL7 kernel crashes with it... :(