Open mpekalski opened 8 years ago
I just found out that clinfo shows that I have two devices (GPUs?) although physically I have one. Maybe that is the reason for the FAILED tests above.
But how to force the tests to use only one device?
$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (1800.8)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
AMD platform includes CPU and GPU devices. It’s up to the application to choose the appropriate device, but, I imagine, it should be defaulting to the GPU device.
From: Marcin Pękalski [mailto:notifications@github.com] Sent: Friday, December 25, 2015 6:04 PM To: amd/OpenCL-caffe OpenCL-caffe@noreply.github.com Subject: Re: [OpenCL-caffe] test_gradient_based_solver fails (#22)
I just found out that clinfo shows that I have two devices (GPUs?) although physically I have one. Maybe that is the reason for the FAILED tests above.
But how to force the tests to use only one device?
$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (1800.8)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
— Reply to this email directly or view it on GitHubhttps://github.com/amd/OpenCL-caffe/issues/22#issuecomment-167269066.
Is it possible to limit number of visible devices by setting some env variable?
Like in case of nVidia one can do it with export CUDA_VISIBLE_DEVICES=0
.
For more details see https://github.com/BVLC/caffe/issues/2926
Bystander observation: I would think that it would be better to choose specifically which GPU to use, rather than to choose how many GPUs to use.
I suspect that Marcin only has a single GPU, but there are two devices: CPU + GPU.
Marcin,
If you want to disable the CPU device, you can set the environment variable CPU_MAX_COMPUTE_UNITS to 0, but I don’t think it will fix your problem. Can you make sure you are running the latest drivers from AMD? Version 1800 seems to be about 6 months old.
Jeff
From: Hugh Perkins [mailto:notifications@github.com] Sent: Friday, December 25, 2015 8:43 PM To: amd/OpenCL-caffe OpenCL-caffe@noreply.github.com Cc: Golds, Jeff Jeffrey.Golds@amd.com Subject: Re: [OpenCL-caffe] test_gradient_based_solver fails (#22)
Bystander observation: I would think that it would be better to choose specifically which GPU to use, rather than to choose how many GPUs to use.
— Reply to this email directly or view it on GitHubhttps://github.com/amd/OpenCL-caffe/issues/22#issuecomment-167274960.
I also have this issue.
Ok guys. Hold on, we will look into this soon. Junli
Sent from my iPhone
On Jan 8, 2016, at 7:40 PM, Aeium notifications@github.com wrote:
I also have this issue.
— Reply to this email directly or view it on GitHub.
May it be related to some failing tests in clBLAS?
Well, this time I made sure clBLAS had passed test-functional and test-short before I tried installing caffe, but I just tried running some of those tests again and they don't work anymore. I'm honestly not really sure if this means caffe is breaking clBLAS or if there is some user error on my part here.
Initialize OpenCL and clblas... ---- Advanced Micro Devices, Inc. SetUp: about to create command queues [==========] Running 715 tests from 5 test cases. [----------] Global test environment set-up. [----------] 203 tests from ERROR [ RUN ] ERROR.InvalidCommandQueue OpenCL error -36 on line 350 of /jenkins/workspace/workspace/Build_Linux_Master_clBLAS/Bitness/64/Configuration/Release/label/acml-build-lin2/src/library/blas/xgemm.cc Segmentation fault (core dumped) nathan@amdRig14://home/nathan/clBLAS/build/staging$
I have another system with a more minimal installation of clBLAS and caffe, i'm going to switch to that and see if clBLAS is still working there.
This clBLAS error you have there from test-functional has been fixed in the latest develop (PR #214) branch by Timmy Liu. It breaks further on, but the issue is kind of the same that the method returns instead of throwing an error or sth like that.
Right, I think I was trying to test the wrong version of clBLAS on that system. Where I am sitting now, I only have the current develop version, and I get this output:
./test-functional
[----------] 136 tests from QUEUE (67714 ms total)
[----------] Global test environment tear-down [==========] 715 tests from 5 test cases ran. (330057 ms total) [ PASSED ] 714 tests. [ FAILED ] 1 test, listed below: [ FAILED ] THREAD.sgemm
I don't recall any of these failing the first time I ran this test. I'm trying to reinstall clBLAS and now i'm getting this issue:
Linking Fortran executable ../staging/test-correctness
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:658: undefined reference to
cdotusub'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:673: undefined reference to
zdotusub'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:688: undefined reference to
cdotcsub'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:703: undefined reference to
zdotcsub'
collect2: error: ld returned 1 exit status
make[2]: * [staging/test-correctness] Error 1
make[1]: * [tests/CMakeFiles/test-correctness.dir/all] Error 2
So, I think what I have now is similar to these issues: https://github.com/clMathLibraries/clBLAS/issues/184 https://github.com/clMathLibraries/clBLAS/issues/142
I remember the clBLAS test-functional and test-short worked before I installed caffe, but I installed more blas libraries between when the clBLAS test worked, and when I did the runtest for caffe. Atlas for example.
This clBLAS issue seems to be caused by a conflict between different blas libraries, so I think getting those dependencies together for caffe after installing clBLAS might have introduced some sort of conflict.
Right now my plan is to just go into the clBLAS cmake files and try to make sure it's getting the same libblas.so it originally used when installed the first time.
The fact that introducing new blas libraries after installing clBLAS seems to have broken it retroactively seems to spell trouble though. I think what really needs to be done is the amount of different BLAS libraries necessary to install OpenCL-caffe and it's dependencies needs to be minimized.
Given that OpenCL caffe needs an BLAS external to clBLAS, I suppose I should have tried to use the same one I used to install clBLAS, and then maybe this could have been avoided.
Same thing here. R9 270X with Xeon 1241 clBLAS test are completed with succsess Ubuntu 14.04
Just installed ubuntu 15.10 Same thing happends as earlier (14.04)
There is also guy @doonny in original caffe issue with same issue running on W9100
I'm having a similar issue running on a W9100, I'm able to run the built in lenet training script but am unable to run anything like 'caffe train -solver etc'
Uhh? Maybe fix? No?
Is your setup still not working?
Haven't tested since last time. Don't think something changed.
thanks for letting us know about this issue. Past two weeks are my holidays break. we will look into this soon.
Junli
On Fri, Feb 12, 2016 at 12:02 AM, sliterok notifications@github.com wrote:
Haven't tested since last time. Don't think something changed.
— Reply to this email directly or view it on GitHub https://github.com/amd/OpenCL-caffe/issues/22#issuecomment-183222136.
Junli Gu--谷俊丽 Coordinated Science Lab University of Illinois at Urbana-Champaign
I managed to get past the following errors:
Linking Fortran executable ../staging/test-correctness
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:658: undefined reference tocdotusub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:673: undefined reference tozdotusub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:688: undefined reference tocdotcsub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:703: undefined reference tozdotcsub_'
collect2: error: ld returned 1 exit status
My solution might be way too hacky, but it works.
Context:
I installed blas using:
sudo apt-get install libopenblas-base libopenblas-dev
I observed that new directory - /usr/lib/openblas-base
is created and there was a file libblas.so
. There was also a same file in /usr/lib. diff
confirmed both files are same. CMakeCache.txt confirmed that this is the linked library: Netlib_BLAS_LIBRARY:FILEPATH=/usr/lib/libblas.so
I opened the clBLAS/src/tests/correctness/blas-lapack.c
, the zdotu
function is conditional coding based on OS. I elfread libblas.so | grep zdotusub_
it was not found. But there was a function cblas_zdotu_sub
, which should be there in case OS was Apple's. But anyways.. I replaced respective lines using calling convention on Apple platform. And it worked.
Pl confirm if this is reproducible, I would like to make my first ever PR :)
PS: I do not understand the code upside down. I have no idea why there are different signatures for different platforms. Thats why I mentioned solution as hack.
Regards, Sagar
Bump?
I have a problem with make runtest failing on SGDSolver and NesterovSolver. I looked at the git repository of BVLC/caffe (https://github.com/BVLC/caffe/issues/3109) and there somebody was referring to a problem coming from the same file test_gradient_based_solver.cpp. In the comments people were writing that it was caused by multiple GPUs present in the system or the fact that Intel MKL's float point operations (such as matrix multiplication) are non-deterministic by default.
Regarding my system, I am running Caffe cloned from github on 22nd of December 2015 on Ubuntu 15.10 with Radeon R9 290 (4GB) and i7-4770K CPU @ 3.50GHz, AMDAPPSDK-3.0. Four tests failed.
If anybody knows how to make them pass or what causes the problem it would be great.