Closed TimoSaemann closed 6 years ago
I have similar issues on a Titan X card with the following build:
-- ******************* Caffe Configuration Summary *******************
-- General:
-- Version : 0.17.0
-- Git : v0.17.0-11-g1044c5e
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- Release CXX flags : -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
-- Debug CXX flags : -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
-- Build type : Release
--
-- BUILD_SHARED_LIBS : ON
-- BUILD_python : ON
-- BUILD_matlab : OFF
-- BUILD_docs : ON
-- USE_LEVELDB : ON
-- USE_LMDB : ON
-- TEST_FP16 : ON
--
-- Dependencies:
-- BLAS : Yes (Open)
-- Boost : Yes (ver. 1.58)
-- glog : Yes
-- gflags : Yes
-- protobuf : Yes (ver. 2.6.1)
-- lmdb : Yes (ver. 0.9.17)
-- LevelDB : Yes (ver. 1.18)
-- Snappy : Yes (ver. 1.1.3)
-- OpenCV : Yes (ver. 2.4.9.1)
-- JPEGTurbo : Yes
-- CUDA : Yes (ver. 9.1)
--
-- NVIDIA CUDA:
-- Target GPU(s) : Auto
-- GPU arch(s) : sm_61
-- cuDNN : Yes (ver. 7.1)
-- NCCL : Yes (ver. ..)
-- NVML : /usr/lib/nvidia-390/libnvidia-ml.so
@TimoSaemann could you please try real inference run with solver_data_type: FLOAT16
? Also, how does it look with more compute intensive nets like RN50? Also, this might be an issue:
I0420 20:14:03.117233 14380 gpu_memory.cpp:107] Total memory: 8233730048, Free: 2097709056, dev_info[0]: total=8233730048 free=2097709056
6G is missing.
@andredubbel This table actually explains why (see column 6.1): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions
@drnikolaev Aha! Thank you. I've been looking for a table like that!
@andredubbel @drnikolaev This surprised us a lot when we first benchmarked NVIDIA Caffe v0.15 on the GTX1080. I believe NVIDIA Caffe v0.17 falls back to using fp32 arithmetic now, so it's not so easy to spot this.
@psyhtest it might fall back when it finds out that it's faster. On Volta it chooses Tensor Cores. Please file a bug if you see opposite. Thank you!
Hi @drnikolaev
A comparison of the runtime on the TX2 between fp16 and fp32 does not make any difference in the forward pass. Only the backward pass is faster at fp16. Why is the forward pass not accelerated?
As an example I use bvlc_alexnet. Please have a look at the logs I have attached.
time_fp16.log time_fp32.log cmake.log
Thanks a lot, Timo