dicecco1 / fpga_caffe

Other
119 stars 51 forks source link

make runtest failing #3

Open sammhho opened 7 years ago

sammhho commented 7 years ago

Hi @dicecco1 ,

Came across this work on SDAccel forum and read your paper, thx for open-sourcing! After tweaking around the Makefile I am able to finish Make-ing the codes, Make runtest passed these tests below, with sw_emu mode of SDAccel, and using the sw xclbins: oclConvolutionLayerTest/1 oclConvolutionLayerTest/3 OCLPoolingLayerTest/1 OCLPoolingLayerTest/3

but it also failed some tests, like: [----------] 6 tests from BlobSimpleTest/1, where TypeParam = double
[ RUN ] BlobSimpleTest/1.TestInitialization
[ OK ] BlobSimpleTest/1.TestInitialization (0 ms)
[ RUN ] BlobSimpleTest/1.TestPointersCPUOCL
src/caffe/test/test_blob.cpp:47: Failure
Value of: this->blob_preshaped_->ocl_data()
Actual: false
Expected: true
src/caffe/test/test_blob.cpp:49: Failure
Value of: this->blob_preshaped_->mutable_ocl_data()
Actual: false
Expected: true
[ FAILED ] BlobSimpleTest/1.TestPointersCPUOCL, where TypeParam = double (0 ms)

another one, with the data mismatch lines repeating many times: [----------] 3 tests from OCLLRNLayerTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN ] OCLLRNLayerTest/2.TestForwardAcrossChannelsLRN2
src/caffe/test/test_lrn_layer.cpp:614: Failure
The difference between this->blob_top_->cpu_data()[i] and top_reference.cpu_data()[i] is 0.10926561057567596, which exceeds this->epsilon_, where
this->blob_top_->cpu_data()[i] evaluates to 0,
top_reference.cpu_data()[i] evaluates to -0.10926561057567596, and
this->epsilon_ evaluates to 9.9999997473787516e-06. ... [ FAILED ] OCLLRNLayerTest/2.TestForwardAcrossChannelsLRN2, where TypeParam = caffe::GPUDevice<float> (11173 ms) [ RUN ] OCLLRNLayerTest/2.TestForwardAcrossChannelsLRN1 *** Aborted at 1501926253 (unix time) try "date -d @1501926253" if you are using GNU date ***
PC: @ 0x7f762f241f54 clWaitForEvents
*** SIGSEGV (@0x0) received by PID 31881 (TID 0x7f7621f11720) from PID 0; stack trace: ***
@ 0x3d0dc0f7e0 (unknown)
@ 0x7f762f241f54 clWaitForEvents
@ 0x7f762e1e23d1 caffe::OCLLRNLayer<>::Call_ocl() ... and then the test ended due to the segmentation fault.

Any ideas? SDAccel version was 2016.1.

dicecco1 commented 7 years ago

Hi @sammhho, thanks for the interest in the project! Sorry for the delay I've been wrapping up my master's lately (which includes making some very large updates to this project which should be pushed here soon). What changes did you make to Makefile? Did you rebuild the lrn layers for the current version of SDAccel that you're using? Also what OS are you using?

This work was mostly tested using SDAccel 2015.3 running on CentOS (either of the SDAccel recommended CentOS distributions should be fine). I've used it with Ubuntu in the past but it takes quite a bit of effort to get it working correctly. I've tried running the specific tests on my end using SDAccel 2016.3 (software emulation only) and it seems to be working correctly.

Steps to reproduce: In Makefile.config set USE_OCL :=1, set CPU_ONLY := 1 (I know this is a confusing setting, it's been updated in my private repository)

make all make test

Build the kernels, I used the commands below (in the source directory) xocc -t sw_emu --platform xilinx:adm-pcie-8k5:2ddr:3.2 --report estimate --nk lrn1_ac_layer:1 --kernel lrn1_ac_layer -s -o lrn1_ac_layer.xclbin lrn1_ac_layer.cl xocc -t sw_emu --platform xilinx:adm-pcie-8k5:2ddr:3.2 --report estimate --nk lrn2_ac_layer:1 --kernel lrn2_ac_layer -s -o lrn2_ac_layer.xclbin lrn2_ac_layer.cl

Copy the .xclbin files into .build_release/opencl/src/caffe/layers/ : cp src/caffe/ocl_caffe/lrn_ac/lrn1/lrn1_ac_layer.xclbin .build_release/opencl/src/caffe/layers/. cp src/caffe/ocl_caffe/lrn_ac/lrn2/lrn2_ac_layer.xclbin .build_release/opencl/src/caffe/layers/.

Run the tests: XCL_EMULATION_MODE=true ./build/test/test_lrn_layer.testbin

sammhho commented 7 years ago

Hi @dicecco1 thx for replying, I wrapped the OpenCL calls with OCL_CHECK() like those in the Xilinx SDAccel repo, and found on some setups (tried both Centos and Ubuntu, SDAccel 2016.1 and 2017.1)

1) clGetDeviceIDs return error code with flag CL_DEVICE_TYPE_ACCELERATOR, using CL_DEVICE_TYPE_ALL instead eliminated it,

2) clCreateProgramWithBinary returned error -42, re-compiling the XCLBINs with the SDAccel version on the machine solved it,

3) clBuildProgram with num_devices set to 0 gives error, commented out the call (since it doesn't seem to be needed in Xilinx flow anyway) solved it, didn't try other argument values...

4) during runtest the clReleaseKernel and clReleaseProgram within XCLProgramLayer don't seem to do anything (which seems natural coz a new XCLProgramLayer instance cant free a previous XCLProgramLayer's kernel and program since the "this" is a different pointer...), and eventually, the Xilinx runtime complain about device already programmed and error out, my solution was to just release the kernel and program at the end of every ocl layer test, and then the runtest finished without error,

Regards, good luck with your master's ;-)

dicecco1 commented 7 years ago

Thanks for the feedback, I was meaning to wrap the calls with OCL_CHECK too (they added those after 2016.1 I think).

  1. I'll have to look into this a bit more, I don't see this on my end for any versions.

  2. This one makes sense, for each version of SDAccel the platform changes, I think it might be better to not package the xclbins with the repo and just have people rebuild it.

  3. This call might be giving an error but isn't actually doing anything I guess from your comment, I'll have to look into this more (it might be an artifact from using the prerelease versions of SDAccel).

  4. Yeah that makes sense, not sure why I did that...

Thanks a lot!

OverDriveMC commented 7 years ago

Hello, I am interested in your project, and I try to run your program, I have installed the SDAccel, but it need licence, I can't find the SDAccel licence on xilinx.com/getProduct , can you share the way how you got it? Thanks

dicecco1 commented 7 years ago

Do you have a development board? To get the license you need to pay for it, I think you would need to contact Xilinx about licensing directly. Alternatively, AWS F1 instances support SDAccel now so you could give it a try there, I've used it for a smaller kernel, but I was having difficulty getting larger kernels to go through P&R last time I tried.

OverDriveMC commented 7 years ago

Thanks