can't use fp32 type and fp16 math in training alexnet

Hannah-xxl commented 6 years ago

I change the solver_fp16.prototxt and train_val_fp16.prototxt like this below, wanting to compare with tensorflow's mixed precision(fp32 type but fp16 math) solver_fp16.prototxt

solver_data_type: FLOAT

train_val_fp16.prototxt

default_forward_type: FLOAT default_backward_type: FLOAT default_forward_math: FLOAT16 default_backward_math: FLOAT16

layer { forward_math: FLOAT16 backward_math: FLOAT16 forward_type: FLOAT backward_type: FLOAT name: "data" type: "Data" ...

Then when I run models/alexnet_owt/train_alexnet_fp16.sh, I get the error output of conv1 layer during network initialization. I0608 18:41:57.315078 37519 net.cpp:199] Created Layer conv1 (2) I0608 18:41:57.315086 37519 net.cpp:571] conv1 <- data I0608 18:41:57.315093 37519 net.cpp:541] conv1 -> conv1 F0608 18:41:57.910822 37519 cudnn_conv_layer.cpp:272] Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED, device 0 *** Check failure stack trace: *** @ 0x7f0137a8f5cd google::LogMessage::Fail() @ 0x7f0137a91433 google::LogMessage::SendToLog() @ 0x7f0137a8f15b google::LogMessage::Flush() @ 0x7f0137a91e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f013886832e caffe::CuDNNConvolutionLayer<>::AllocateWorkspace() @ 0x7f013886d4e7 caffe::CuDNNConvolutionLayer<>::Reshape() @ 0x7f0138624422 caffe::Net::Init() @ 0x7f013862632e caffe::Net::Net() @ 0x7f013860e047 caffe::Solver::InitTrainNet() @ 0x7f013860e5e4 caffe::Solver::Init() @ 0x7f013860eab2 caffe::Solver::Solver() @ 0x7f01385ffac6 caffe::Creator_SGDSolver() @ 0x418ea6 caffe::SolverRegistry::CreateSolver() @ 0x411b95 train() @ 0x40c778 main @ 0x7f0136c03830 __libc_start_main @ 0x40d1f9 _start @ (nil) (unknown) Aborted (core dumped)

If I set the type to fp32 and the math to fp16, it works well. So, does nvcaffe couldn't support this kind of precision mode? I am not familiar with caffe's code, but I think somebody here could explain this. Thank you.

drnikolaev commented 6 years ago

Hi @brave-hannah when you set

solver_data_type: FLOAT
...
default_forward_type: FLOAT
default_backward_type: FLOAT
default_forward_math: FLOAT16
default_backward_math: FLOAT16

You actually set NVCaffe to complete fp32 mode except cuDNN math in convolution layer. The error above is probably a result of OOM because fp32 data requires twice more space plus extra storage for convolution blobs (plus conversion cost).

Here are the best two modes:

"Pure fp16" mode:

solver_data_type: FLOAT16
...
default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT16
default_backward_math: FLOAT16
...
layer {
name: "data"
...

Example: https://github.com/NVIDIA/caffe/tree/models/AN-FP16-20180324

"Pseudo fp16" mode:

solver_data_type: FLOAT
...
default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT16
default_backward_math: FLOAT16
...
layer {
name: "data"
...

Same as pure but weights are stored as fp32 values.

Hannah-xxl commented 6 years ago

Thank you @drnikolaev. But I'm still confused about OOM error you mentioned. I use P4 accelerator which has 8GB memory, so how come does OOM happen? Could you show me the piece of code in NVCaffe probably causes OOM？

drnikolaev commented 6 years ago

@brave-hannah to answer that I have to reproduce the issue. Please upload complete log here.

Hannah-xxl commented 6 years ago

I run this test on bare mental server with two P4, installed with ubuntu16.04, cuda 8.0, cudnn 6.0.21, cuda driver 375.26 and nvcaffe 0.17. The testing script is from nvcaffe,

./models/alexnet_owt/train_alexnet_fp16.sh

and I change the content of solver_fp16.prototxt and train_val_fp16.prototxt as described above. Here is the full output of testing script, a bit long log. @drnikolaev

root@ubuntu150:/home/nvcaffe-0.17# sh models/alexnet_owt/train_alexnet_fp16.sh I0611 21:00:20.998628 15974 common.cpp:475] GPU 0 'Tesla P4' has compute capability 6.1 I0611 21:00:20.999625 15974 common.cpp:475] GPU 1 'Tesla P4' has compute capability 6.1 I0611 21:00:21.500021 15974 caffe.cpp:680] This is NVCaffe 0.17.0 started at Mon Jun 11 21:00:20 2018 I0611 21:00:21.500053 15974 caffe.cpp:682] CuDNN version: 6021 I0611 21:00:21.500057 15974 caffe.cpp:683] CuBLAS version: 8000 I0611 21:00:21.500061 15974 caffe.cpp:684] CUDA version: 8000 I0611 21:00:21.500063 15974 caffe.cpp:685] CUDA driver version: 8000 I0611 21:00:21.500068 15974 caffe.cpp:686] Arguments:

I0611 21:00:21.949667 15974 gpu_memory.cpp:105] GPUMemory::Manager initialized I0611 21:00:21.950287 15974 gpu_memory.cpp:107] Total memory: 7975272448, Free: 7692156928, dev_info0: total=7975272448 free=7692156928 I0611 21:00:21.950770 15974 gpu_memory.cpp:107] Total memory: 7975272448, Free: 7692156928, dev_info1: total=7975272448 free=7846297600 I0611 21:00:21.950779 15974 caffe.cpp:222] Using GPUs 0, 1 I0611 21:00:21.951020 15974 caffe.cpp:226] GPU 0: Tesla P4 I0611 21:00:21.951249 15974 caffe.cpp:226] GPU 1: Tesla P4 I0611 21:00:21.951544 15974 solver.cpp:40] Solver data type: FLOAT I0611 21:00:21.951671 15974 common.cpp:192] New stream 0x11b4c3a0, device 0, thread 15974 I0611 21:00:21.961113 15974 common.cpp:60] 0 New Caffe instance 0xe7aa760, count 1, thread 15974, tid 15974 I0611 21:00:21.961202 15974 solver.cpp:43] Initializing solver from parameters: test_iter: 15 test_interval: 5000 base_lr: 0.02 display: 100 max_iter: 1500 lr_policy: "poly" power: 2 momentum: 0.9 weight_decay: 0.0005 snapshot: 500000 snapshot_prefix: "models/alexnet_owt/snapshots/alexnet_fp16" solver_mode: GPU device_id: 0 random_seed: 1 net: "models/alexnet_owt/train_val_fp16.prototxt" train_state { level: 0 stage: "" } snapshot_after_train: true test_initialization: false min_lr: 5e-06 solver_data_type: FLOAT I0611 21:00:21.961366 15974 solver.cpp:84] Creating training net from net file: models/alexnet_owt/train_val_fp16.prototxt I0611 21:00:21.962152 15974 net.cpp:456] The NetState phase (0) differed from the phase (1) specified by a rule in layer data I0611 21:00:21.962203 15974 net.cpp:456] The NetState phase (0) differed from the phase (1) specified by a rule in layer top-5 I0611 21:00:21.962406 15974 net.cpp:79] Initializing net from parameters: name: "AlexNet-OWT-fp16" state { phase: TRAIN level: 0 stage: "" } default_forward_type: FLOAT default_backward_type: FLOAT default_forward_math: FLOAT16 default_backward_math: FLOAT16 layer { name: "data" type: "Data" top: "data" top: "label" include { phase: TRAIN } transform_param { mirror: true crop_size: 224 mean_file: "data/ilsvrc12/imagenet_mean.binaryproto" } data_param { source: "examples/imagenet/ilsvrc12_train_lmdb" batch_size: 512 backend: LMDB } forward_type: FLOAT backward_type: FLOAT forward_math: FLOAT16 backward_math: FLOAT16 } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 96 kernel_size: 11 stride: 4 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "relu1" type: "ReLU" bottom: "conv1" top: "conv1" } layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "conv2" type: "Convolution" bottom: "pool1" top: "conv2" convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 1 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu2" type: "ReLU" bottom: "conv2" top: "conv2" } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "conv3" type: "Convolution" bottom: "pool2" top: "conv3" convolution_param { num_output: 384 pad: 1 kernel_size: 3 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "relu3" type: "ReLU" bottom: "conv3" top: "conv3" } layer { name: "conv4" type: "Convolution" bottom: "conv3" top: "conv4" convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 1 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu4" type: "ReLU" bottom: "conv4" top: "conv4" } layer { name: "conv5" type: "Convolution" bottom: "conv4" top: "conv5" convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 1 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu5" type: "ReLU" bottom: "conv5" top: "conv5" } layer { name: "pool5" type: "Pooling" bottom: "conv5" top: "pool5" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "fc6" type: "InnerProduct" bottom: "pool5" top: "fc6" inner_product_param { num_output: 4096 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu6" type: "ReLU" bottom: "fc6" top: "fc6" } layer { name: "drop6" type: "Dropout" bottom: "fc6" top: "fc6" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc7" type: "InnerProduct" bottom: "fc6" top: "fc7" inner_product_param { num_output: 4096 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu7" type: "ReLU" bottom: "fc7" top: "fc7" } layer { name: "drop7" type: "Dropout" bottom: "fc7" top: "fc7" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc8" type: "InnerProduct" bottom: "fc7" top: "fc8" inner_product_param { num_output: 1000 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "loss" type: "SoftmaxWithLoss" bottom: "fc8" bottom: "label" top: "loss" } layer { name: "top-1" type: "Accuracy" bottom: "fc8" bottom: "label" top: "accuracy/top-1" accuracy_param { top_k: 1 } } I0611 21:00:21.963246 15974 net.cpp:132] Setting types for Layer data I0611 21:00:21.963266 15974 layer_factory.hpp:172] Creating layer 'data' of type 'Data' I0611 21:00:21.963280 15974 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT16 Bmath:FLOAT16 I0611 21:00:21.963601 15974 internal_thread.cpp:19] Starting 1 internal thread(s) on device 0 I0611 21:00:21.963784 15974 data_transformer.cpp:40] Loading mean file from: data/ilsvrc12/imagenet_mean.binaryproto I0611 21:00:21.963932 16027 common.cpp:192] New stream 0x7f9b0c000ca0, device 0, thread 16027 I0611 21:00:21.964792 16027 common.cpp:60] 0 New Caffe instance 0x7f9b0c000c00, count 2, thread 16027, tid 16027 I0611 21:00:21.964862 16027 internal_thread.cpp:78] Started internal thread 16027 on device 0, rank 0 I0611 21:00:21.964895 16027 blocking_queue.cpp:40] Data layer prefetch queue empty I0611 21:00:21.984944 15974 data_transformer.cpp:40] Loading mean file from: data/ilsvrc12/imagenet_mean.binaryproto I0611 21:00:22.003049 15974 net.cpp:199] Created Layer data (0) I0611 21:00:22.003073 15974 net.cpp:541] data -> data I0611 21:00:22.003228 15974 net.cpp:541] data -> label I0611 21:00:22.003396 15974 data_reader.cpp:58] Sample Data Reader threads: 1, out queues: 1, depth: 512 I0611 21:00:22.003655 15974 internal_thread.cpp:19] Starting 1 internal thread(s) on device 0 I0611 21:00:22.003823 16028 common.cpp:192] New stream 0x7f9b00000ca0, device 0, thread 16028 I0611 21:00:22.004560 16028 common.cpp:60] 0 New Caffe instance 0x7f9b00000c00, count 3, thread 16028, tid 16028 I0611 21:00:22.004616 16028 internal_thread.cpp:78] Started internal thread 16028 on device 0, rank 0 I0611 21:00:22.004721 16028 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb I0611 21:00:22.005476 15974 data_layer.cpp:199] 0 Output data size: 512, 3, 224, 224 I0611 21:00:22.005506 15974 data_transformer.cpp:40] Loading mean file from: data/ilsvrc12/imagenet_mean.binaryproto I0611 21:00:22.025029 15974 data_transformer.cpp:40] Loading mean file from: data/ilsvrc12/imagenet_mean.binaryproto I0611 21:00:22.043567 15974 internal_thread.cpp:19] Starting 1 internal thread(s) on device 0 I0611 21:00:22.043701 15974 net.cpp:259] Setting up data I0611 21:00:22.043759 15974 net.cpp:266] TRAIN Top shape for layer 0 'data' 512 3 224 224 (77070336) I0611 21:00:22.043818 16029 common.cpp:192] New stream 0x7f99f8000ca0, device 0, thread 16029 I0611 21:00:22.044440 16029 common.cpp:60] 0 New Caffe instance 0x7f99f8000c00, count 4, thread 16029, tid 16029 I0611 21:00:22.044474 16029 internal_thread.cpp:78] Started internal thread 16029 on device 0, rank 0 I0611 21:00:22.044483 15974 net.cpp:266] TRAIN Top shape for layer 0 'data' 512 (512) I0611 21:00:22.044539 15974 net.cpp:132] Setting types for Layer label_data_1_split I0611 21:00:22.044554 15974 layer_factory.hpp:172] Creating layer 'label_data_1_split' of type 'Split' I0611 21:00:22.044574 15974 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT16 Bmath:FLOAT16 I0611 21:00:22.044642 15974 net.cpp:199] Created Layer label_data_1_split (1) I0611 21:00:22.044693 15974 net.cpp:571] label_data_1_split <- label I0611 21:00:22.044764 15974 net.cpp:541] label_data_1_split -> label_data_1_split_0 I0611 21:00:22.044826 15974 net.cpp:541] label_data_1_split -> label_data_1_split_1 I0611 21:00:22.045105 15974 net.cpp:259] Setting up label_data_1_split I0611 21:00:22.045148 15974 net.cpp:266] TRAIN Top shape for layer 1 'label_data_1_split' 512 (512) I0611 21:00:22.045178 15974 net.cpp:266] TRAIN Top shape for layer 1 'label_data_1_split' 512 (512) I0611 21:00:22.045202 15974 net.cpp:132] Setting types for Layer conv1 I0611 21:00:22.045219 15974 layer_factory.hpp:172] Creating layer 'conv1' of type 'Convolution' I0611 21:00:22.045241 15974 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT16 Bmath:FLOAT16 I0611 21:00:22.045390 15974 net.cpp:199] Created Layer conv1 (2) I0611 21:00:22.045421 15974 net.cpp:571] conv1 <- data I0611 21:00:22.045469 15974 net.cpp:541] conv1 -> conv1 I0611 21:00:22.045662 15974 common.cpp:192] New stream 0x12873800, device 0, thread 15974 F0611 21:00:22.694185 15974 cudnn_conv_layer.cpp:272] Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED, device 0 Check failure stack trace: @ 0x7f9bc382b5cd google::LogMessage::Fail() @ 0x7f9bc382d433 google::LogMessage::SendToLog() @ 0x7f9bc382b15b google::LogMessage::Flush() @ 0x7f9bc382de1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f9bc4cede77 caffe::CuDNNConvolutionLayer<>::AllocateWorkspace() @ 0x7f9bc4ce9626 caffe::CuDNNConvolutionLayer<>::Reshape() @ 0x7f9bc49f5199 caffe::LayerBase::SetUp() @ 0x7f9bc49e7c39 caffe::Net::Init() @ 0x7f9bc49e55c9 caffe::Net::Net() @ 0x7f9bc49cabf6 caffe::Solver::InitTrainNet() @ 0x7f9bc49ca37b caffe::Solver::Init() @ 0x7f9bc49c9cec caffe::Solver::Solver() @ 0x7f9bc49ab115 caffe::SGDSolver<>::SGDSolver() @ 0x7f9bc49bbed5 caffe::Creator_SGDSolver() @ 0x44a252 caffe::SolverRegistry::CreateSolver() @ 0x442412 train() @ 0x447304 main @ 0x7f9bc299f830 __libc_start_main @ 0x440df9 _start @ (nil) (unknown) Aborted (core dumped)

drnikolaev commented 6 years ago

@brave-hannah I0611 21:00:21.500053 15974 caffe.cpp:682] CuDNN version: 6021 - before we go any further, please try to upgrade it. Also, upgrading CUDA 9 might be a good idea too.

Hannah-xxl commented 6 years ago

We tried this test on our another server installed with cuda9 and cudnn7.1.4, still got the same error. Here is the output. We use NVCaffe 0.16.3 in this server but not 0.17. @drnikolaev

I0612 14:09:30.037569 209399 caffe.cpp:470] This is NVCaffe 0.16.3 started at Tue Jun 12 14:09:29 2018 I0612 14:09:30.037778 209399 caffe.cpp:473] CuDNN version: 7104 I0612 14:09:30.037783 209399 caffe.cpp:474] CuBLAS version: 9000 I0612 14:09:30.037787 209399 caffe.cpp:475] CUDA version: 9000 I0612 14:09:30.037791 209399 caffe.cpp:476] CUDA driver version: 9000 I0612 14:09:30.040922 209399 gpu_memory.cpp:159] GPUMemory::Manager initialized with Caching (CUB) GPU Allocator I0612 14:09:30.041784 209399 gpu_memory.cpp:161] Total memory: 16936861696, Free: 16431448064, dev_info[0]: total=16936861696 free=16431448064 I0612 14:09:30.041806 209399 caffe.cpp:198] Using GPUs 0 I0612 14:09:30.042361 209399 caffe.cpp:203] GPU 0: Tesla V100-SXM2-16GB I0612 14:09:30.042665 209399 solver.cpp:42] Solver data type: FLOAT I0612 14:09:30.042739 209399 solver.cpp:45] Initializing solver from parameters: test_iter: 195 test_interval: 5000 base_lr: 0.02 display: 100 max_iter: 125000 lr_policy: "poly" power: 2 momentum: 0.9 weight_decay: 0.0005 snapshot: 500000 snapshot_prefix: "models/alexnet_owt/snapshots/alexnet_fp16" solver_mode: GPU device_id: 0 random_seed: 1 net: "models/alexnet_owt/train_val_fp16.prototxt" snapshot_after_train: true test_initialization: false min_lr: 5e-06 solver_data_type: FLOAT I0612 14:09:30.070838 209399 solver.cpp:86] Creating training net from net file: models/alexnet_owt/train_val_fp16.prototxt I0612 14:09:30.071346 209399 net.cpp:441] The NetState phase (0) differed from the phase (1) specified by a rule in layer data I0612 14:09:30.071377 209399 net.cpp:441] The NetState phase (0) differed from the phase (1) specified by a rule in layer top-5 I0612 14:09:30.071661 209399 net.cpp:70] Initializing net from parameters: name: "AlexNet-OWT-fp16" state { phase: TRAIN } default_forward_type: FLOAT default_backward_type: FLOAT default_forward_math: FLOAT16 default_backward_math: FLOAT16 layer { name: "data" type: "Data" top: "data" top: "label" include { phase: TRAIN } transform_param { mirror: true crop_size: 227 mean_file: "data/ilsvrc12/imagenet_mean.binaryproto" } data_param { source: "examples/imagenet/ilsvrc12_train_lmdb" batch_size: 1024 backend: LMDB } } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 96 kernel_size: 11 stride: 4 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "relu1" type: "ReLU" bottom: "conv1" top: "conv1" } layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "conv2" type: "Convolution" bottom: "pool1" top: "conv2" convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 1 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu2" type: "ReLU" bottom: "conv2" top: "conv2" } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "conv3" type: "Convolution" bottom: "pool2" top: "conv3" convolution_param { num_output: 384 pad: 1 kernel_size: 3 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "relu3" type: "ReLU" bottom: "conv3" top: "conv3" } layer { name: "conv4" type: "Convolution" bottom: "conv3" top: "conv4" convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 1 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu4" type: "ReLU" bottom: "conv4" top: "conv4" } layer { name: "conv5" type: "Convolution" bottom: "conv4" top: "conv5" convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 1 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu5" type: "ReLU" bottom: "conv5" top: "conv5" } layer { name: "pool5" type: "Pooling" bottom: "conv5" top: "pool5" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "fc6" type: "InnerProduct" bottom: "pool5" top: "fc6" inner_product_param { num_output: 4096 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu6" type: "ReLU" bottom: "fc6" top: "fc6" } layer { name: "drop6" type: "Dropout" bottom: "fc6" top: "fc6" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc7" type: "InnerProduct" bottom: "fc6" top: "fc7" inner_product_param { num_output: 4096 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 0.1 } } } layer { name: "relu7" type: "ReLU" bottom: "fc7" top: "fc7" } layer { name: "drop7" type: "Dropout" bottom: "fc7" top: "fc7" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc8" type: "InnerProduct" bottom: "fc7" top: "fc8" inner_product_param { num_output: 1000 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "loss" type: "SoftmaxWithLoss" bottom: "fc8" bottom: "label" top: "loss" } layer { name: "top-1" type: "Accuracy" bottom: "fc8" bottom: "label" top: "accuracy/top-1" accuracy_param { top_k: 1 } } I0612 14:09:30.071903 209399 layer_factory.hpp:136] Creating layer 'data' of type 'Data' I0612 14:09:30.071923 209399 layer_factory.hpp:148] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT16 Bmath:FLOAT16 I0612 14:09:30.074594 209399 net.cpp:182] Created Layer data (0) I0612 14:09:30.074618 209399 net.cpp:528] data -> data I0612 14:09:30.074648 209399 net.cpp:528] data -> label I0612 14:09:30.074671 209399 data_transformer.cpp:26] Loading mean file from: data/ilsvrc12/imagenet_mean.binaryproto I0612 14:09:30.085580 209399 data_reader.cpp:52] Sample Data Reader threads: 1, out queues: 1, depth: 1024 I0612 14:09:30.085893 209399 internal_thread.cpp:19] Starting 1 internal thread(s) on device 0 I0612 14:09:30.088255 209528 db_lmdb.cpp:35] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb I0612 14:09:30.090051 209399 data_layer.cpp:184] [0] ReshapePrefetch 1024, 3, 227, 227 I0612 14:09:30.090131 209399 data_layer.cpp:208] [0] Output data size: 1024, 3, 227, 227 I0612 14:09:30.090145 209399 internal_thread.cpp:19] Starting 1 internal thread(s) on device 0 I0612 14:09:30.090268 209399 net.cpp:243] Setting up data I0612 14:09:30.090288 209399 net.cpp:250] TRAIN Top shape for layer 0 'data' 1024 3 227 227 (158297088) I0612 14:09:30.090304 209399 net.cpp:250] TRAIN Top shape for layer 0 'data' 1024 (1024) I0612 14:09:30.090320 209399 layer_factory.hpp:136] Creating layer 'label_data_1_split' of type 'Split' I0612 14:09:30.090332 209399 layer_factory.hpp:148] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT16 Bmath:FLOAT16 I0612 14:09:30.090350 209399 net.cpp:182] Created Layer label_data_1_split (1) I0612 14:09:30.090361 209399 net.cpp:559] label_data_1_split <- label I0612 14:09:30.090384 209399 net.cpp:528] label_data_1_split -> label_data_1_split_0 I0612 14:09:30.090397 209399 net.cpp:528] label_data_1_split -> label_data_1_split_1 I0612 14:09:30.090451 209399 net.cpp:243] Setting up label_data_1_split I0612 14:09:30.090463 209399 net.cpp:250] TRAIN Top shape for layer 1 'label_data_1_split' 1024 (1024) I0612 14:09:30.090471 209399 net.cpp:250] TRAIN Top shape for layer 1 'label_data_1_split' 1024 (1024) I0612 14:09:30.090476 209399 layer_factory.hpp:136] Creating layer 'conv1' of type 'Convolution' I0612 14:09:30.090484 209399 layer_factory.hpp:148] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT16 Bmath:FLOAT16 I0612 14:09:30.090559 209399 net.cpp:182] Created Layer conv1 (2) I0612 14:09:30.090600 209399 net.cpp:559] conv1 <- data I0612 14:09:30.090607 209399 net.cpp:528] conv1 -> conv1 I0612 14:09:31.191092 209399 cudnn_conv_layer.cpp:1011] [0] Conv Algos (F,BD,BF): 'conv1' with space 0.01G/1 0 0 0 (limit 12.53G, req 0G) F0612 14:09:31.191145 209399 cudnn_conv_layer.cpp:451] Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED Check failure stack trace: @ 0x7f71089645cd google::LogMessage::Fail() @ 0x7f7108966433 google::LogMessage::SendToLog() @ 0x7f710896415b google::LogMessage::Flush() @ 0x7f7108966e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f71093c5857 caffe::CuDNNConvolutionLayer<>::Reshape() @ 0x7f71091a0e7c caffe::Net::Init() @ 0x7f71091a2b78 caffe::Net::Net() @ 0x7f7109501878 caffe::Solver::InitTrainNet() @ 0x7f7109501e7d caffe::Solver::Init() @ 0x7f7109502352 caffe::Solver::Solver() @ 0x7f710920c246 caffe::Creator_SGDSolver() @ 0x413cf6 caffe::SolverRegistry::CreateSolver() @ 0x40d514 train() @ 0x40a908 main @ 0x7f71078d4830 __libc_start_main @ 0x40b0c9 _start @ (nil) (unknown) Aborted (core dumped)

Hannah-xxl commented 6 years ago

@drnikolaev So, do you have any results about this issue?

drnikolaev commented 6 years ago

I'm working on new release, this should be fixed out there

Hannah-xxl commented 6 years ago

Great, looking forward it. Thank you very much.

drnikolaev commented 6 years ago

@brave-hannah Please verify https://github.com/NVIDIA/caffe/tree/v0.17.1 release and reopen the issue if needed.

NVIDIA / caffe

can't use fp32 type and fp16 math in training alexnet #514