Errors in training when using FLOAT16

abdelrahman-gaber commented 6 years ago

Hi,

I am training a model with caffe-0.17 and want to use fp16 support. The training is running well when I use the normal float, but once I add these lines in the train.prototxt: default_forward_type: FLOAT16 default_backward_type: FLOAT16 default_forward_math: FLOAT default_backward_math: FLOAT some error happen after 3 iterations of the training as follows:

I0329 22:53:12.084344 27634 solver.cpp:356] Iteration 0 (22.093 s), loss = 7.1875
I0329 22:53:12.084543 27634 solver.cpp:374]     Train net output #0: mbox_loss = 5.14453 (* 1 = 5.14453 loss)
I0329 22:53:12.084563 27634 sgd_solver.cpp:170] Iteration 0, lr = 0.001, m = 0.9, wd = 0.0005, gs = 1
I0329 22:53:13.856938 27634 solver.cpp:356] Iteration 1 (1.77261 s), loss = 7.2124
I0329 22:53:13.857154 27634 solver.cpp:374]     Train net output #0: mbox_loss = 4.65234 (* 1 = 4.65234 loss)
I0329 22:53:15.739784 27634 solver.cpp:356] Iteration 2 (1.88284 s), loss = 6.85091
I0329 22:53:15.739866 27634 solver.cpp:374]     Train net output #0: mbox_loss = 5.03906 (* 1 = 5.03906 loss)
F0329 22:53:59.788343 27634 bbox_util.cu:624] Check failed: label < num_classes (2 vs. 2) 
*** Check failure stack trace: ***
    @     0x7f31f78675cd  google::LogMessage::Fail()
    @     0x7f31f7869433  google::LogMessage::SendToLog()
    @     0x7f31f786715b  google::LogMessage::Flush()
    @     0x7f31f7869e1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f31f86feee8  caffe::ComputeConfLossGPU<>()
    @     0x7f31f8251887  caffe::MineHardExamples<>()
    @     0x7f31f86866c4  caffe::MultiBoxLossLayer<>::Forward_cpu()
    @     0x7f31f828a22a  caffe::Layer<>::Forward()
    @     0x7f31f81ca7eb  caffe::Net::ForwardFromTo()
    @     0x7f31f81ca957  caffe::Net::Forward()
    @     0x7f31f81ce3e5  caffe::Net::ForwardBackward()
    @     0x7f31f81e85c3  caffe::Solver::Step()
    @     0x7f31f81e9c9d  caffe::Solver::Solve()
    @           0x412693  train()
    @           0x40c8e0  main
    @     0x7f31f5f94830  __libc_start_main
    @           0x40d409  _start
    @              (nil)  (unknown)

The error also changes when I run the training again, and it can come like this: Check failed: label >= 0 (-1 vs. 0) or Check failed: label < num_classes (3 vs. 2)

when I replace FLOAT16 by FLOAT it works fine!. I am using GPU Tesla V100-SXM2 with 16GB memory, with CUDA 9.0 and CUDNN 7.0. I want to make sure that fp16 is supported for this configuration (this GPU and cuda libraries), and also the problem is not the same which indicate that something is not stable, is there any modification I should do to allow the support of fp16.

Thank you.

drnikolaev commented 6 years ago

hi @abdelrahman-gaber I need to reproduce this issue in order to fix it. May I have your model files and the dataset? Also, could you try

default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT16
default_backward_math: FLOAT16

abdelrahman-gaber commented 6 years ago

When I use all FLOAT16 as you said, I got the error: Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED So, I used the pseudo fp16 as mentioned here: https://github.com/NVIDIA/caffe/issues/198, then I got the errors mentioned before.

This is the train.prototxt file I am using: https://drive.google.com/open?id=1_5YS_XY_rPKsV_uXHn6hXDSIB8yEiVo9 This is SFD model for face detection, and I am training with WIDER_FACE dataset. Note that I used the same exact training file and lmdb files with the code from SSD branch in original caffe and the training was done correctly.

Thank you.

drnikolaev commented 6 years ago

@abdelrahman-gaber what particular script did you use to create your LMDB?

abdelrahman-gaber commented 6 years ago

I used the script provided by the original SSD implementation with minor modification to accept text files, my script can be found here: https://drive.google.com/file/d/1HBbGD4-G2mhqmeIW8aY5CeQBxTHwTRt1/view?usp=sharing

Note that the lmdb files generated was used to train the model with the original SSD implementation and it worked fine, so I used the same lmdb files when training the model with nvidia caffe but found these problems.

drnikolaev commented 6 years ago

@abdelrahman-gaber sorry, I need complete step-by-step instructions to re-build the lmdb. Your script uses label map, which I need to re-create too etc.

abdelrahman-gaber commented 6 years ago

I solved this problem by running the same lmdb script again with NVIDIA Caffe version, which generated new lmdb files (with the same scripts and same data, just re-run it again with the new caffe).

However, the problem when using all FLOAT16 still exists, and it only works with: default_forward_type: FLOAT16 default_backward_type: FLOAT16 default_forward_math: FLOAT default_backward_math: FLOAT

drnikolaev commented 6 years ago

@abdelrahman-gaber seems like we are not synced yet. :) I don't have those scripts. SSD source does not provide scripts for "wider faces" set. Please send me everything you have with short instructions. Thank you.

abdelrahman-gaber commented 6 years ago

I am sorry for that, I uploaded all necessary files and they are as follows, here you can find all scripts and files used to generate the lmdb, which contains the create_data.sh, images-groundtruth list (train.txt) and the label map file (labelmap_wider.prototxt). This folder need to exist under the path $CAFFE_ROOT/data/
https://drive.google.com/drive/folders/18Hp9xGPQPKDx3Vu6lyssRf07kI_K3kHy?usp=sharing

As the ground truth bboxes need to be converted to certain format, here are the ground truth files used for the training: https://drive.google.com/file/d/1Iw48nhHIplZvBfpTFvmR7L1IXrCGyn02/view?usp=sharing

The images for training can be downloaded from the website: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/ the folders containing the training images and ground truth text files should both be under $HOME/data/ you can change this path in the create_data.sh script.

Please tell me if I missed any step.

drnikolaev commented 6 years ago

Hi @abdelrahman-gaber thank you for reporting this bug. It's reproduced and fixed now. Please read the note for https://github.com/NVIDIA/caffe/pull/493 about performance implications and SSD fp16 sample models. I'd appreciate your feedback.

abdelrahman-gaber commented 6 years ago

Thank you so much. I will run the training again by the middle of this week, and tell you if I faced any problem.

drnikolaev commented 6 years ago

@abdelrahman-gaber Please also do

layer {

  forward_type:  FLOAT
  backward_type: FLOAT

  name: "data"
  type: "AnnotatedData"
  ...

I'll fix it later

abdelrahman-gaber commented 6 years ago

Thank you @drnikolaev The training is working now after fixing this bug. I just have a question, in the inference with "test.prototxt" and "deploy.txt" should I just add these lines at the beginning of the prototxt files or there are other modifications to be done ?

default_forward_type: FLOAT16 default_backward_type: FLOAT16 default_forward_math: FLOAT16 default_backward_math: FLOAT16

abdelrahman-gaber commented 6 years ago

Hi @drnikolaev The training can run now without reporting any errors, but the training process itself is not working well. I read about the scaling of gradients which is necessary for the training in fp16 mode, as I understand I should tune the parameter: global_grad_scale in train.prototxt, you set it as 32 in the example provided, and I tried to make it 256, 1000, and even 20000, but still the validation accuracy is very very low (around 0.002) after 1K to 2K iterations. I compared it with training the model in fp32 mode, and after 1K iteration validation accuracy was 0.15, and after 2K it is 0.31, and increasing. My resources to learn about this mixed precision training are: https://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf http://on-demand.gputechconf.com/gtc/2017/presentation/s7218-training-with-mixed-precision-boris-ginsburg.pdf I also came through this issue https://github.com/NVIDIA/caffe/issues/420 I am not sure what is the mistake I am doing, is there any other change I should do to allow for training with fp16 ? Also in the slides they mention gradients scale and loss scale, but here we just set the variable global_grad_scale and nothing about loss scale! What is the range of this scale, should I also try to make it less than 1, ie. 0.1 or 0.01 etc.

I also faced another problem when trying to use the VGG pretrained model, it is working well in fp32 mode, but not working with fp16 mode! I reported this in a new issue https://github.com/NVIDIA/caffe/issues/499

All files I am using for training and testing are here, in logs folder you can find the output logs for different configurations I tried. https://drive.google.com/drive/folders/14z_oEB1gKsOP9B5JGme-tznte8aKa77Q?usp=sharing

Thank you.

drnikolaev commented 6 years ago

@abdelrahman-gaber sorry about delay. Please try to switch back to

 convolution_param {

  engine: CAFFE

    num_output: 1024
    pad: 6 
    kernel_size: 3
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
    dilation: 6 
  }

abdelrahman-gaber commented 6 years ago

@drnikolaev Thank you for your reply. I did all modifications you mentioned, but actually the problem is still the same. Now I am trying the training without using any pretrained model, and when I set the IO and math types to Float like this:

default_forward_type: FLOAT
default_backward_type: FLOAT
default_forward_math: FLOAT
default_backward_math: FLOAT

Only in this case the model is working well; the validation accuracy is increasing, and training loss is decreasing. However, when I use FLOAT16 to all of them or use it with the first two (forward_type, backward type) and set the other two to FLOAT, the model is not learning, even after 40K the validation accuracy is around 0.004 and training loss is around 10 which indicates something wrong in the training process.
I also tried many values of "global_grad_scale" but still not working! Here are all the files I am using to train and test the mode: https://drive.google.com/drive/folders/13VV0V2v19A_ByLQ6L7oABEHTAA544VYL?usp=sharing

I would be more than thankful if you can try this training process by yourself, all files for preparing the dataset is as mentioned in the previous comments. Also I hope you can give an estimated time for solving this issue.

Thank you.

drnikolaev commented 6 years ago

@abdelrahman-gaber could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate?

abdelrahman-gaber commented 6 years ago

@drnikolaev Thanks for the update, I will test it and tell you.

abdelrahman-gaber commented 6 years ago

@drnikolaev Thank you, It seems that the training is working now, the validation accuracy is increasing and test loss is decreasing. However, I can only train from scratch and still not able to use the pretrained model as mentioned here: https://github.com/NVIDIA/caffe/issues/499

I will let the training run until the end and will tell you if noticed weird behavior.

drnikolaev commented 6 years ago

@abdelrahman-gaber Please verify https://github.com/NVIDIA/caffe/tree/v0.17.1 release and reopen the issue if needed.

NVIDIA / caffe

Errors in training when using FLOAT16 #490