Closed abdelrahman-gaber closed 6 years ago
hi @abdelrahman-gaber I need to reproduce this issue in order to fix it. May I have your model files and the dataset? Also, could you try
default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT16
default_backward_math: FLOAT16
When I use all FLOAT16 as you said, I got the error: Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED So, I used the pseudo fp16 as mentioned here: https://github.com/NVIDIA/caffe/issues/198, then I got the errors mentioned before.
This is the train.prototxt file I am using: https://drive.google.com/open?id=1_5YS_XY_rPKsV_uXHn6hXDSIB8yEiVo9 This is SFD model for face detection, and I am training with WIDER_FACE dataset. Note that I used the same exact training file and lmdb files with the code from SSD branch in original caffe and the training was done correctly.
Thank you.
@abdelrahman-gaber what particular script did you use to create your LMDB?
I used the script provided by the original SSD implementation with minor modification to accept text files, my script can be found here: https://drive.google.com/file/d/1HBbGD4-G2mhqmeIW8aY5CeQBxTHwTRt1/view?usp=sharing
Note that the lmdb files generated was used to train the model with the original SSD implementation and it worked fine, so I used the same lmdb files when training the model with nvidia caffe but found these problems.
@abdelrahman-gaber sorry, I need complete step-by-step instructions to re-build the lmdb. Your script uses label map, which I need to re-create too etc.
I solved this problem by running the same lmdb script again with NVIDIA Caffe version, which generated new lmdb files (with the same scripts and same data, just re-run it again with the new caffe).
However, the problem when using all FLOAT16 still exists, and it only works with: default_forward_type: FLOAT16 default_backward_type: FLOAT16 default_forward_math: FLOAT default_backward_math: FLOAT
@abdelrahman-gaber seems like we are not synced yet. :) I don't have those scripts. SSD source does not provide scripts for "wider faces" set. Please send me everything you have with short instructions. Thank you.
I am sorry for that, I uploaded all necessary files and they are as follows,
here you can find all scripts and files used to generate the lmdb, which contains the create_data.sh, images-groundtruth list (train.txt) and the label map file (labelmap_wider.prototxt). This folder need to exist under the path $CAFFE_ROOT/data/
https://drive.google.com/drive/folders/18Hp9xGPQPKDx3Vu6lyssRf07kI_K3kHy?usp=sharing
As the ground truth bboxes need to be converted to certain format, here are the ground truth files used for the training: https://drive.google.com/file/d/1Iw48nhHIplZvBfpTFvmR7L1IXrCGyn02/view?usp=sharing
The images for training can be downloaded from the website: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/ the folders containing the training images and ground truth text files should both be under $HOME/data/ you can change this path in the create_data.sh script.
Please tell me if I missed any step.
Hi @abdelrahman-gaber thank you for reporting this bug. It's reproduced and fixed now. Please read the note for https://github.com/NVIDIA/caffe/pull/493 about performance implications and SSD fp16 sample models. I'd appreciate your feedback.
Thank you so much. I will run the training again by the middle of this week, and tell you if I faced any problem.
@abdelrahman-gaber Please also do
layer {
forward_type: FLOAT
backward_type: FLOAT
name: "data"
type: "AnnotatedData"
...
I'll fix it later
Thank you @drnikolaev The training is working now after fixing this bug. I just have a question, in the inference with "test.prototxt" and "deploy.txt" should I just add these lines at the beginning of the prototxt files or there are other modifications to be done ?
default_forward_type: FLOAT16 default_backward_type: FLOAT16 default_forward_math: FLOAT16 default_backward_math: FLOAT16
Hi @drnikolaev The training can run now without reporting any errors, but the training process itself is not working well. I read about the scaling of gradients which is necessary for the training in fp16 mode, as I understand I should tune the parameter: global_grad_scale
in train.prototxt, you set it as 32 in the example provided, and I tried to make it 256, 1000, and even 20000, but still the validation accuracy is very very low (around 0.002) after 1K to 2K iterations. I compared it with training the model in fp32 mode, and after 1K iteration validation accuracy was 0.15, and after 2K it is 0.31, and increasing.
My resources to learn about this mixed precision training are:
https://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf
http://on-demand.gputechconf.com/gtc/2017/presentation/s7218-training-with-mixed-precision-boris-ginsburg.pdf
I also came through this issue https://github.com/NVIDIA/caffe/issues/420
I am not sure what is the mistake I am doing, is there any other change I should do to allow for training with fp16 ? Also in the slides they mention gradients scale and loss scale, but here we just set the variable global_grad_scale
and nothing about loss scale! What is the range of this scale, should I also try to make it less than 1, ie. 0.1 or 0.01 etc.
I also faced another problem when trying to use the VGG pretrained model, it is working well in fp32 mode, but not working with fp16 mode! I reported this in a new issue https://github.com/NVIDIA/caffe/issues/499
All files I am using for training and testing are here, in logs folder you can find the output logs for different configurations I tried. https://drive.google.com/drive/folders/14z_oEB1gKsOP9B5JGme-tznte8aKa77Q?usp=sharing
Thank you.
@abdelrahman-gaber sorry about delay. Please try to switch back to
convolution_param {
engine: CAFFE
num_output: 1024
pad: 6
kernel_size: 3
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0.0
}
dilation: 6
}
@drnikolaev Thank you for your reply. I did all modifications you mentioned, but actually the problem is still the same. Now I am trying the training without using any pretrained model, and when I set the IO and math types to Float like this:
default_forward_type: FLOAT
default_backward_type: FLOAT
default_forward_math: FLOAT
default_backward_math: FLOAT
Only in this case the model is working well; the validation accuracy is increasing, and training loss is decreasing.
However, when I use FLOAT16 to all of them or use it with the first two (forward_type, backward type) and set the other two to FLOAT, the model is not learning, even after 40K the validation accuracy is around 0.004 and training loss is around 10 which indicates something wrong in the training process.
I also tried many values of "global_grad_scale" but still not working!
Here are all the files I am using to train and test the mode:
https://drive.google.com/drive/folders/13VV0V2v19A_ByLQ6L7oABEHTAA544VYL?usp=sharing
I would be more than thankful if you can try this training process by yourself, all files for preparing the dataset is as mentioned in the previous comments. Also I hope you can give an estimated time for solving this issue.
Thank you.
@abdelrahman-gaber could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate?
@drnikolaev Thanks for the update, I will test it and tell you.
@drnikolaev Thank you, It seems that the training is working now, the validation accuracy is increasing and test loss is decreasing. However, I can only train from scratch and still not able to use the pretrained model as mentioned here: https://github.com/NVIDIA/caffe/issues/499
I will let the training run until the end and will tell you if noticed weird behavior.
@abdelrahman-gaber Please verify https://github.com/NVIDIA/caffe/tree/v0.17.1 release and reopen the issue if needed.
Hi,
I am training a model with caffe-0.17 and want to use fp16 support. The training is running well when I use the normal float, but once I add these lines in the train.prototxt: default_forward_type: FLOAT16 default_backward_type: FLOAT16 default_forward_math: FLOAT default_backward_math: FLOAT some error happen after 3 iterations of the training as follows:
The error also changes when I run the training again, and it can come like this: Check failed: label >= 0 (-1 vs. 0) or Check failed: label < num_classes (3 vs. 2)
when I replace FLOAT16 by FLOAT it works fine!. I am using GPU Tesla V100-SXM2 with 16GB memory, with CUDA 9.0 and CUDNN 7.0. I want to make sure that fp16 is supported for this configuration (this GPU and cuda libraries), and also the problem is not the same which indicate that something is not stable, is there any modification I should do to allow the support of fp16.
Thank you.