could you share your script for producing these protos and results of each model?

Tongcheng commented 7 years ago

@jiangxuehan Hi! Actually because in my implementation of the model I can specify an entire DenseBlock (tens of transitions) as one layer, so the entire DenseBlock was manually created by prototxt, and there is no scripts for generate proto. But there are only about 10 layers if you look at each prototxt so I think it should be manually doable. And I will update the results of each model soon.

jiangxuehan commented 7 years ago

@Tongcheng I have run models in this repo, for k=12 and L=100, the accuracy on cifar10+ is 94.8%, but it should be 95.5% according to the paper. Looking forward your results.

Tongcheng commented 7 years ago

@jiangxuehan Thanks for pointing out! I currently have the same result, which is about 0.8% lower than torch counterpart. This is actually a known issue: https://github.com/liuzhuang13/DenseNet/issues/10 . In my caffe, based on this issue, I did one fix, which is use cudnn version of the BatchNormalization (torch has the smoothing factor for EMA estimation set to 0.1, but caffe does it an entirely different way), this makes the convergence curve (accuracy) looks similar to the one in the paper (Figure4-Right), but doesn't seem to improve final accuracy gap. I am currently investigating the cause, and this is actually one reason the results of each model is not yet updated.

Tongcheng commented 7 years ago

@jiangxuehan It turns out caffe's datalayer is feeding data without permutation, now I add a flag to permute the data, which turns the accuracy to 95.2%

jiangxuehan commented 7 years ago

@Tongcheng. Thanks for your reply. Using ImageDataLayer with shuffle option can get the same result(95.2%) with your modified DataLayer. Do you think if there are some other differences between Torch and Caffe that will affect model performance?

Tongcheng commented 7 years ago

@jiangxuehan Currently I have no definitive conclusion of the remaining 0.3% divergence, but there are several hypothesis: (1) Source of randomness: besides different random seed, one additional source of randomness is from Convolution Algorithms, the torch version use deterministic convolution algorithms, which corresponds to (1,1,1) in cudnn convolution, however, if I use my random seed and deterministic convolution, then it is somewhat lower result (95.1%). (2) We did some space-time tradeoff to achieve space efficiency, in particular, for BC networks' backward phase, I have to do an extra (Convolution1*1 Forward) to recompute/overwrite the space to intermediate channels, this convolution might introduce some numerical instability (combined with its BN forward) into the system, causing the overall performance slightly worse, this is an unavoidable part if we want space efficiency. There could also be some tricky parts that I haven't seen yet, I would welcome any constructive idea that might work. Thanks!

Tongcheng commented 7 years ago

@jiangxuehan Also, I think my datalayer with random option should be superior than the default imageDataLayer implementation because ImageDataLayer did the shuffling on a vector of Datum, which are quite big objects but I did shuffling on index of Datum which are smaller objects. So there should be some time efficiency within my implementation.

John1231983 commented 7 years ago

Hi, I found your explanation about different between caffe and torch implementation for BN. I guess you tried to modify the BatchNorm layer in caffe to make it is similar the torch. How much different performance by using the modification? In addition, I also want to use the modify for my Caffe version. So, I just copy/paste your batch_norm_layer.cu and replace it by my current batch_norm_layer.cu (and header file also). Is it right?

Finally, the BN often follows by a Scale layer. But in your prototxt, I did not see the Scale layer after BN, such as https://github.com/Tongcheng/DN_CaffeScript/blob/master/train_test_BCBN_C10plus.prototxt. Or Did you already integrate it together?

layer {
  name: "BatchNorm1"
  type: "BatchNorm"
  bottom: "DenseBlock1"
  top: "BatchNorm1"
  batch_norm_param {
    moving_average_fraction : 0.1
    scale_filler {
      type: "constant"
      value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
    engine: CUDNN
  }  
}

Tongcheng commented 7 years ago

Hi @John1231983 , the torch version's use cudnn version of BatchNormalization, which already includes the scale layer in the function, so in my version of caffe's modified BatchNorm, there is no need to put additional ScaleLayer behind BatchNorm. The difference is mainly the difference in smoothing factors of EMA, which means different training curve shapes by different batchNorm.

John1231983 commented 7 years ago

Thanks for point out this. Could you tell me which file did you change for BatchNorm layer? I would like to test it in my caffe version by changing these file. I check the batch_norm_layer.cu but it is similar the current batch_norm_layer.cu of caffe, just different some log print.

Tongcheng / DN_CaffeScript

could you share your script for producing these protos and results of each model? #1