Closed Ultron-11 closed 6 years ago
I tested the speed of the fp16 on the resnet50. When the 8 card is only twice as fast as the fp32, is this normal? The only change is that I changed the DataLayer to an InputLayer.
When the 8 card is only twice as fast as the fp32, is this normal?
@Ultron-11 I got similar numbers. I guess it's normal. It matches many other people's benchmark you can find online such as this one: http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2017/09/27/deep-learning-on-v100.
@Ultron-11 @1duo - it actually depends on how GPUs talk to each other. PCIe, NVLink, NVSwitch?
@drnikolaev Thanks for your reply, my configurations are very similar to http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2017/09/27/deep-learning-on-v100. 8 V100 GPUs connected via PCIe. And I got similar numbers as mentioned in above article, my numbers are ~2300 img/sec for FP16, ~1200 img/sec for FP32 ResNet50 on ImageNet. Are these numbers expected? Can we improve it further?
@1duo could you upload your prototxt file(s) here please? Also, here is a sample run on 8xV100 on NVLink (see the last line): https://github.com/NVIDIA/caffe/blob/models/RN50-FP16-20180201/resnet50-0.16.6-idl-fp16-88ep_10526.log
@1duo @Ultron-11 could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate?
@drnikolaev I no longer have access to V100 machines. Can't help here, sorry for the inconvenience. Thanks.
I tested the performance using FP16 type, it seemed that FP16 did not faster than FP32 but slower. Environment:
model:example/cifar10/train_full.sh
Please have a look at the logs I have attached. fp16.log fp32.log
It can be noticed that the number of iterations per second of FP16 is less than FP32