ROCm / hipCaffe

(Deprecated) hipCaffe: the HIP port of Caffe
Other
124 stars 27 forks source link

Getting core dumps on "real" workloads #41

Closed davclark closed 6 years ago

davclark commented 6 years ago

Issue summary

Initially reported in #19 that I am getting issues with test failures as well as core dumps, but just reporting on core dumps here for now.

In short, NetTest/0.TestReshape, as well as my attempts at running the MNIST and CaffeNet all end with a core dump. Data for CIFAR-10 has an integrity problem...

(I've been out of the game long enough that I'm not sure how to get the stack trace with gdb... I'm happy to look into this further.)

Steps to reproduce

If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

Your system configuration

Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0, ROCm & DKMS from the ROCm PPA. GPU: AMD RX 580, drivers from rocm PPA CPU: Threadripper 1900X on X399 chipset Compiler: (I think this is the one you want?) hcc version=1.2.18063-7e18c64-ac8732c-710f135, workweek (YYWWD) = 18063 BLAS: ROCBLAS (I assume - that's the default in Makefile.config) Python or MATLAB version (for pycaffe and matcaffe respectively): standard Ubuntu Python 2.7, no matlab

davclark commented 6 years ago

Also tried running as root to see if it was a permissions problem... I think it's not. Still core dump with sudo.

davclark commented 6 years ago

Just upgraded to most recent kernel modules, etc. (e.g., hip_hcc 1.5.18081, rocblas 0.13.2.1, compute-firmware 1.7.18, rock-dkms 1.7.148, rocm-opencl 1.2.0.2018041722, among many others). Tried a clean recompile and test - still same errors (including those from #19), but now I get a print-out of the stack-trace:

PC: @     0x7f5e04135512 cfree
*** SIGSEGV (@0x7f5cfffffff8) received by PID 81158 (TID 0x7f5e0bb5bbc0) from PID 18446744073709551608; stack trace: ***
    @     0x7f5e0af32390 (unknown)
    @     0x7f5e04135512 cfree
    @     0x7f5e0634ea8f miopen::Db::Db()
    @     0x7f5e06444506 mlo_construct_direct2D::GetDb()
    @     0x7f5e064448dc mlo_construct_BwdWrW2D::FindSolution()
    @     0x7f5e0633f097 miopen::ConvolutionDescriptor::BackwardWeightsGetWorkSpaceSizeDirect()
    @     0x7f5e0633f48a miopen::ConvolutionDescriptor::ConvolutionBackwardWeightsGetWorkSpaceSize()
    @     0x7f5e06349747 miopenConvolutionBackwardWeightsGetWorkSpaceSize
    @          0x13fbd42 caffe::CuDNNConvolutionLayer<>::Reshape()
    @          0x1ab9d89 caffe::Net<>::Init()
    @          0x1ab8d95 caffe::Net<>::Net()
    @           0xa3e80a caffe::NetTest<>::InitNetFromProtoString()
    @           0xa3f9dc caffe::NetTest<>::InitReshapableNet()
    @           0xaa37f5 caffe::NetTest_TestReshape_Test<>::TestBody()
    @          0x1081c24 testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1081ae6 testing::Test::Run()
    @          0x1082c11 testing::TestInfo::Run()
    @          0x1083477 testing::TestCase::Run()
    @          0x1089b17 testing::internal::UnitTestImpl::RunAllTests()
    @          0x1089554 testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1089509 testing::UnitTest::Run()
    @          0x1f660ea main
    @     0x7f5e040d1830 __libc_start_main
    @          0x1f62e09 _start
    @                0x0 (unknown)
Segmentation fault (core dumped)
sunway513 commented 6 years ago

Hi @davclark , it can be a user bits configuration issue. Could you try to use our official docker image and see if the issue remains? Plz follow the following instructions to configure the docker environment: https://github.com/RadeonOpenCompute/ROCm-docker/blob/master/quick-start.md

And use the following command to launch our official hipcaffe docker image: sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfine rocm/hipcaffe:rocm1.7.1

davclark commented 6 years ago

The basic ROCm image works with the test program vector_copy.

I believe I successfullly ran the above image... (corrected unconfine->unconfined). However, I still get a core dump on the tests, in addtion to a number of warnings like this:

MIOpen(HIP): Warning [ReadFile] File is unreadable.

The stacktrace for core dump:

MIOpen Error: /data/repo/MIOpen/src/ocl/softmaxocl.cpp:59: Only alpha=1 and beta=0 is supported
F0501 03:19:47.299984    24 cudnn_softmax_layer_hip.cpp:27] Check failed: status == miopenStatusSuccess (7 vs. 0)  miopenStatusUnknownError
*** Check failure stack trace: ***
    @     0x7fa3099ae5cd  google::LogMessage::Fail()
    @     0x7fa3099b0433  google::LogMessage::SendToLog()
    @     0x7fa3099ae15b  google::LogMessage::Flush()
    @     0x7fa3099b0e1e  google::LogMessageFatal::~LogMessageFatal()
    @          0x150c2ae  caffe::CuDNNSoftmaxLayer<>::Forward_gpu()
    @           0x4f4377  caffe::Layer<>::Forward()
    @          0x1e1ca73  caffe::SoftmaxWithLossLayer<>::Forward_gpu()
    @           0x4f4377  caffe::Layer<>::Forward()
    @          0x1ac8747  caffe::Net<>::ForwardFromTo()
    @          0x1ac8660  caffe::Net<>::Forward()
    @           0x55a336  caffe::NetTest_TestBackwardWithAccuracyLayer_Test<>::TestBody()
    @          0x1081584  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1081446  testing::Test::Run()
    @          0x1082571  testing::TestInfo::Run()
    @          0x1082dd7  testing::TestCase::Run()
    @          0x1089477  testing::internal::UnitTestImpl::RunAllTests()
    @          0x1088eb4  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1088e69  testing::UnitTest::Run()
    @          0x1f6659a  main
    @     0x7fa304e15830  __libc_start_main
    @          0x1f62fb9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

This seems more like a test configuration issue, though... Note that I am not building hipCaffe, just trying to run the test program that's already there in the image.

sunway513 commented 6 years ago

Hi @davclark , thanks for the further information. Could you provide me the exact steps to repro your issue?

davclark commented 6 years ago

Looking at the info above, I should clarify I am actually on 16.04.4 - I can't find any info on how to "downgrade" to 16.04.3, it seems you're either on the HWE branch, or not. I understand that there is some issue with ROCm and the 16.04.4 kernel (currently 4.13.0-39-generic - strangely, uname -a reports 16.04.1, even though lsb_release reports 16.04.4)?

I've got rocm (including rock / rockt, etc.) installed via AMD's rocm PPA.

In any case, First, I verified that the ROCm image seems to work for me on Docker, including compiling and running the "vector-copy" program. To reproduce the above failure, I simply run your docker command above (fixing a type-o): sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/hipcaffe:rocm1.7.1

Then, I run /root/hipCaffe/build/test/test_all.testbin.

Is that what you meant? Happy to provide more info. I can certainly try switching to the HWE-edge kernel also (which is ported from 18.04).

davclark commented 6 years ago

Another idea is that perhaps I should not be using the HWE kernel? I'll see what happens if I use the base LTS kernel... but this makes me think to ask whether there are any other expected settings I may be missing.

sunway513 commented 6 years ago

Hi @davclark , we know some hipCaffe direct tests can fail. That's normal, and even the upstream caffe can not pass all its direct tests. However, that typically won't affect its functionality on real DL workloads. If you find any issues by using hipCaffe with your own model, please provide it to us or recommend some public models similar to your own.

davclark commented 6 years ago

All example models fail. For example, setting up and trying to run the MNIST example results in a core dump. I'm more concerned about core dumps than tolerance violations! E.g. from the hipCaffe dir:

./data/mnist/get_mnist.sh ./examples/mnist/create_mnist.sh ./examples/mnist/train_lenet.sh

sunway513 commented 6 years ago

Hi @davclark , those samples should execute fine. Let's focus on the mnist sample for now.

Could you try to upgrade to ROCm1.7.2? It's publicly available now. sudo apt update && sudo apt upgrade Then reboot and verify if the KFD module is properly loaded: lsmod | grep kfd

Then, change to use the ROCm1.7.2 docker image: sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfine rocm/hipcaffe:rocm1.7.2

If the issue remains, please provide the complete failure log and the output: uname -a apt --installed list | grep rock

davclark commented 6 years ago

You all are moving fast, I see! by the time I was able to try this, the rocm repo had updated to 1.8. So, that's what I got after an update just now.

The 1.7.2 docker image works fine (again, a type-o on seccomp=unconfined was missing the "d" at the end).

In case it's useful, my kernel was at the following version (I'm on the HWE kernel): 4.13.0-41-generic

Thank you!