Runtime fp16 vs fp32 on TX2

TimoSaemann commented 6 years ago

Hi @drnikolaev

A comparison of the runtime on the TX2 between fp16 and fp32 does not make any difference in the forward pass. Only the backward pass is faster at fp16. Why is the forward pass not accelerated?

As an example I use bvlc_alexnet. Please have a look at the logs I have attached.

time_fp16.log time_fp32.log cmake.log

Thanks a lot, Timo

andredubbel commented 6 years ago

I have similar issues on a Titan X card with the following build:

-- ******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   0.17.0
--   Git               :   v0.17.0-11-g1044c5e
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
-- 
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   TEST_FP16         :   ON
-- 
-- Dependencies:
--   BLAS              :   Yes (Open)
--   Boost             :   Yes (ver. 1.58)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 2.6.1)
--   lmdb              :   Yes (ver. 0.9.17)
--   LevelDB           :   Yes (ver. 1.18)
--   Snappy            :   Yes (ver. 1.1.3)
--   OpenCV            :   Yes (ver. 2.4.9.1)
--   JPEGTurbo         :   Yes
--   CUDA              :   Yes (ver. 9.1)
-- 
-- NVIDIA CUDA:
--   Target GPU(s)     :   Auto
--   GPU arch(s)       :   sm_61
--   cuDNN             :   Yes (ver. 7.1)
--   NCCL              :   Yes (ver. ..)
--   NVML              :   /usr/lib/nvidia-390/libnvidia-ml.so

drnikolaev commented 6 years ago

@TimoSaemann could you please try real inference run with solver_data_type: FLOAT16? Also, how does it look with more compute intensive nets like RN50? Also, this might be an issue:

I0420 20:14:03.117233 14380 gpu_memory.cpp:107] Total memory: 8233730048, Free: 2097709056, dev_info[0]: total=8233730048 free=2097709056

6G is missing.

drnikolaev commented 6 years ago

@andredubbel This table actually explains why (see column 6.1): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

andredubbel commented 6 years ago

@drnikolaev Aha! Thank you. I've been looking for a table like that!

psyhtest commented 6 years ago

@andredubbel @drnikolaev This surprised us a lot when we first benchmarked NVIDIA Caffe v0.15 on the GTX1080. I believe NVIDIA Caffe v0.17 falls back to using fp32 arithmetic now, so it's not so easy to spot this.

drnikolaev commented 6 years ago

@psyhtest it might fall back when it finds out that it's faster. On Volta it chooses Tensor Cores. Please file a bug if you see opposite. Thank you!

NVIDIA / caffe

Runtime fp16 vs fp32 on TX2 #501