dkashkin commented 5 years ago

This is a question, not a bug. @AlexeyAB is there a way to incorporate the "Visually Coherent Image Mixup" augmentation strategy? Amazon recently published a paper that claims that this approach results in a massive mAP improvement for YOLOv3: https://arxiv.org/abs/1902.04103 Needless to say, I would love a chance to try this idea and I'd be happy to share my results.

AlexeyAB commented 5 years ago

[net]
mixup=1

can be used in Classifier and Detector.

BoF (Bag of Freebies) includes 5 features:

synchronize BN - +0.5 mAP -requires a large number of GPUs and a long training time - increases mini_batch_size by reducing the training speed. Instead we can use GPU-processing + CPU-RAM for large mini_batch_size - it reduces training speed too, but doesn't require many GPUs: https://github.com/AlexeyAB/darknet/issues/4386
random training shapes - +1.0 mAP, is implemented random=1 https://github.com/AlexeyAB/darknet/blob/3aa2e45ad369c72622f9458b6ebc7abb24226879/cfg/yolov3-spp.cfg#L821
cosine lr schedule - +0.5 mAP - is implemented SGDR: https://github.com/AlexeyAB/darknet/pull/2651
```
policy=sgdr
#sgdr_cycle=1000    # you can set 1000 or just comment this line
#sgdr_mult=2
```
class label smoothing - +0.5 mAP - use label_smooth_eps=0.1 or 0.01 in the [yolo] or [Gaussian_yolo] layers - is added: https://github.com/AlexeyAB/darknet/commit/318919e1cbb362aac6cb5c3d9388735f9ab594b6 and [net] label_smooth_eps=0.1 for Classifier - is added: https://github.com/AlexeyAB/darknet/commit/2a873f34485c75d44a346f92ba7dcf2e2aa57a15
mixup - +1.5 mAP - is implemented [net] mixup=1 - this thread https://github.com/AlexeyAB/darknet/issues/3272 or even better [net] mosaic=1 - https://github.com/AlexeyAB/darknet/issues/4432 and https://github.com/AlexeyAB/darknet/issues/4264

@dkashkin Hi, It looks like a good and universal solution.

Mixup should be used also for darknet53.conv pre-trained weights:

dkashkin commented 5 years ago

@AlexeyAB yes this sounds great. I like the simplicity of this idea - it's just an additional augmentation strategy that can be implemented in a few lines of code. The question is - can it be easily supported in Darknet training? Ideally, this should be an optional line in the config file...

AlexeyAB commented 5 years ago

@dkashkin Yes, may be, there are required only 2 lines in cfg-file:

mixup=1
freebies_alpha_beta=1.5 for B(1.5, 1.5), weighted loss.

But it can be implemented in several lines only on Python - I just try to understand what do they do for mixup (bag of freebies), so as not to miss important details and points: https://arxiv.org/pdf/1710.09412v2.pdf

Where is numpy.random.beta(): https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.beta.html

https://github.com/numpy/numpy/blob/ba53a63ee9be7a7bad7685b051a0ef984caa321e/numpy/random/mtrand.pyx#L345-L386

https://github.com/numpy/numpy/blob/ba53a63ee9be7a7bad7685b051a0ef984caa321e/numpy/random/src/distributions/distributions.c#L533-L566

Beta distribution: https://www.astroml.org/book_figures/chapter3/fig_beta_distribution.html And wiki: https://en.wikipedia.org/wiki/Beta_distribution

Beta-distribution on C will be something like this - it has not even the most complex implementation :) https://github.com/numpy/numpy/issues/688

Beta-distribution on C++: https://gist.github.com/sftrabbit/5068941 SO: https://stackoverflow.com/questions/15165202/random-number-generator-with-beta-distribution

#include <iostream>
#include <sstream>
#include <string>
#include <random>

namespace sftrabbit {

  template <typename RealType = double>
  class beta_distribution
  {
    public:
      typedef RealType result_type;

      class param_type
      {
        public:
          typedef beta_distribution distribution_type;

          explicit param_type(RealType a = 2.0, RealType b = 2.0)
            : a_param(a), b_param(b) { }

          RealType a() const { return a_param; }
          RealType b() const { return b_param; }

          bool operator==(const param_type& other) const
          {
            return (a_param == other.a_param &&
                    b_param == other.b_param);
          }

          bool operator!=(const param_type& other) const
          {
            return !(*this == other);
          }

        private:
          RealType a_param, b_param;
      };

      explicit beta_distribution(RealType a = 2.0, RealType b = 2.0)
        : a_gamma(a), b_gamma(b) { }
      explicit beta_distribution(const param_type& param)
        : a_gamma(param.a()), b_gamma(param.b()) { }

      void reset() { }

      param_type param() const
      {
        return param_type(a(), b());
      }

      void param(const param_type& param)
      {
        a_gamma = gamma_dist_type(param.a());
        b_gamma = gamma_dist_type(param.b());
      }

      template <typename URNG>
      result_type operator()(URNG& engine)
      {
        return generate(engine, a_gamma, b_gamma);
      }

      template <typename URNG>
      result_type operator()(URNG& engine, const param_type& param)
      {
        gamma_dist_type a_param_gamma(param.a()),
                        b_param_gamma(param.b());
        return generate(engine, a_param_gamma, b_param_gamma); 
      }

      result_type min() const { return 0.0; }
      result_type max() const { return 1.0; }

      RealType a() const { return a_gamma.alpha(); }
      RealType b() const { return b_gamma.alpha(); }

      bool operator==(const beta_distribution<result_type>& other) const
      {
        return (param() == other.param() &&
                a_gamma == other.a_gamma &&
                b_gamma == other.b_gamma);
      }

      bool operator!=(const beta_distribution<result_type>& other) const
      {
        return !(*this == other);
      }

    private:
      typedef std::gamma_distribution<result_type> gamma_dist_type;

      gamma_dist_type a_gamma, b_gamma;

      template <typename URNG>
      result_type generate(URNG& engine,
        gamma_dist_type& x_gamma,
        gamma_dist_type& y_gamma)
      {
        result_type x = x_gamma(engine);
        return x / (x + y_gamma(engine));
      }
  };

  template <typename CharT, typename RealType>
  std::basic_ostream<CharT>& operator<<(std::basic_ostream<CharT>& os,
    const beta_distribution<RealType>& beta)
  {
    os << "~Beta(" << beta.a() << "," << beta.b() << ")";
    return os;
  }

  template <typename CharT, typename RealType>
  std::basic_istream<CharT>& operator>>(std::basic_istream<CharT>& is,
    beta_distribution<RealType>& beta)
  {
    std::string str;
    RealType a, b;
    if (std::getline(is, str, '(') && str == "~Beta" &&
        is >> a && is.get() == ',' && is >> b && is.get() == ')') {
      beta = beta_distribution<RealType>(a, b);
    } else {
      is.setstate(std::ios::failbit);
    }
    return is;
  }

}

void data_augmentation(...) {
  std::random_device rd;
  std::mt19937 gen(rd()); 

  // beta_val1 = 1.5, beta_val2 = 1.5 for B(1.5, 1.5)
  sftrabbit::beta_distribution<double> beta_distr_obj(beta_val1, beta_val2); 
  double beta_distribution  = beta_distr_obj(gen);

 float alpha_blend = beta_distribution ;
 float beta_blend= 1 - beta_distribution;

 cv::addWeighted( src1, alpha_blend, src2, beta_blend, 0.0, dst);    // mixup images
 fuse_labels(src_label1, alpha_blend, src_label2, beta_blend, new_label);    // mixup labels
}

The implementation of mixup training is straightforward, and introduces a minimal computation overhead. Figure 1a shows the few lines of code necessary to implement mixup training in PyTorch. Finally, we mention alternative design choices. First, in preliminary experiments we find that convex combinations of three or more examples with weights sampled from a Dirichlet distribution does not provide further gain, but increases the computation cost of mixup. Second, our current implementation uses a single data loader to obtain one minibatch, and then mixup is applied to the same minibatch after random shuffling. We found this strategy works equally well, while reducing I/O requirements. Third, interpolating only between inputs with equal label did not lead to the performance gains of mixup discussed in the sequel. More empirical comparison can be found in Section 3.8.

What is mixup doing? The mixup vicinal distribution can be understood as a form of data augmentation that encourages the model f to behave linearly in-between training examples. We argue that this linear behaviour reduces the amount of undesirable oscillations when predicting outside the training examples. Also, linearity is a good inductive bias from the perspective of Occam’s razor, since it is one of the simplest possible behaviors. Figure 1b shows that mixup leads to decision boundaries that transition linearly from class to class, providing a smoother estimate of uncertainty. Figure 2 illustrate the average behaviors of two neural network models trained on the CIFAR-10 dataset using ERM and mixup. Both models have the same architecture, are trained with the same procedure, and are evaluated at the same points in-between randomly sampled training data. The model trained with mixup is more stable in terms of model predictions and gradient norms in-between training samples.

Also what do they do for LSR (class label smoothing), very academically written: https://arxiv.org/pdf/1512.00567v3.pdf

dkashkin commented 5 years ago

@AlexeyAB sorry for delay! I missed your reply. It might be more reliable to discuss this over email (kashkin at gmail). I agree with you - there are some papers that describe mixup via heavy math resulting in unnecessary complexity. I like the following visual explanation much better: Assuming that this infographic captures all the important concepts, I hope this augmentation should be easy to implement. If darknet uses OpenCV, the alpha blending can be done by calling cv2.addWeighted...

AlexeyAB commented 5 years ago

@dkashkin Yes, may be we can try to implement mixup with fixed alpha_blending = 1 - alpha_blending =0.5 and without weighted loss. It will be much more simpler:

dkashkin commented 5 years ago

@AlexeyAB I think this would be a great starting point! P.S. I have one YOLO-specific idea that might be also interesting to test. Since YOLO training datasets usually include a lot of unlabeled images, we could restrict the mixup algorithm to blending each "labeled" image with a randomly selected "unlabeled" image. This approach solves the need to add noise to the training images without creating any new overlapping bounding boxes. My assumption is that the classic mixup strategy can cause some classification problems. For example, if you blend a cat photo with a dog photo, you might end up with one training image that has two identical bounding boxes with different labels (cat and dog). I am afraid, such images can make it harder for the neural network to differentiate cats and dogs. I would not be surprised if mixup strategy that avoids such overlaps can outperform the original mixup algorithm from the whitepaper...

AlexeyAB commented 5 years ago

@dkashkin Hi,

I added Mixup data augmentation. You should add just 1 parameter mixup=1 in the [net] section, and in 50% of images will be used Mixup with alpha=beta=0.5 for images, without weighted loss.

If you want to see result of data augmentation use flag -show_imgs in training command:

LukeAI commented 5 years ago

Thanks @AlexeyAB ! I'm running a test now, I'll report back on the results shortly. Question: Presumably it would be a bad idea to set this option with the LSTM models? As I imagine this would mess up the frame to frame continuity which the model depends on?

LukeAI commented 5 years ago

Running the latest repo with the mixup option is causing the process to be "killed": This happened twice, the last time was on iteration 345. 2080Ti compiled with opencv, mixed prescision my_stuff/train.sh: line 1: 13095 Killed ./darknet detector train my_stuff/obj.data my_stuff/yolov3-tiny_3l.cfg my_stuff/yolov3-tiny.conv.15 -dont_show -mjpeg_port 8090 -map -letter_box

train.sh ./darknet detector train my_stuff/obj.data my_stuff/yolov3-tiny_3l.cfg my_stuff/yolov3-tiny.conv.15 -dont_show -mjpeg_port 8090 -map -letter_box

[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=4
width=544
height=544
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
mixup=1

learning_rate=0.001
burn_in=1000
max_batches = 16000
policy=steps
steps=12800,14400
scales=.1,.1

An identical model that I trained yesterday was fine, the only difference is that yesterday's model used the repo. as it existed yesterday and didn't have the mixup flag set.

LukeAI commented 5 years ago

I just tried running the same again without -letter_box and it crashed in the same way. Running again now without mix_up=1 and it seems to be going fine so far.

AlexeyAB commented 5 years ago

@LukeAI Hi, can you share your cfg-file and dataset if it isn't private?

How many CPU-RAM do you have?

Question: Presumably it would be a bad idea to set this option with the LSTM models? As I imagine this would mess up the frame to frame continuity which the model depends on?

In general, I think it should work, since will be mixed-up two sequences (not just random images).

LukeAI commented 5 years ago

The dataset is a subset of Google OpenImages yolov3-tiny_3l.cfg.txt train.txt

AlexeyAB commented 5 years ago

@LukeAI

Running the latest repo with the mixup option is causing the process to be "killed":

I fixed it.

LukeAI commented 5 years ago

Hey all - just to feedback, I found that mixup hurt my AP slightly in all classes when using the above cfg and Dataset at the final validation. (I didn't try older weights but the chart.png looked fairly flat.)

LukeAI commented 5 years ago

For some reason, mixup led to a very similar validation accuracy but a higher, noisier loss. no_mixup mixup_chart

LukeAI commented 5 years ago

kitti kitti_mixup

Just to share my results from the kitti dataset trained on yolov3_tiny_3l. The high Loss, lower validation is the one with mixup=1

AlexeyAB commented 5 years ago

@LukeAI

As I see there are only training(labeled) and testing(un-labeled) folders in the data_object_image_2.zip.

Did you use Training dataset for validation?
What script did you use to convert labels to Yolo format?
Did you try to train LSTM-model on Kitty-tracking? http://www.cvlibs.net/datasets/kitti/eval_mots.php

LukeAI commented 5 years ago

I randomly split the labelled kitti OBJECT2D dataset into 85% training, 15% testing

I wrote my own script to do the conversions. kitti2yolo.py.txt

AlexeyAB commented 5 years ago

@LukeAI So it seems that Mixup doesn't increase mAP for the most cases, or it requires more iterations. Or it mixup should be used for training Classification model that will be used as pre-trained weights for training Detector.

LukeAI commented 5 years ago

yolov3-tiny_3l.cfg.txt This is the config I was using, don't know if it's too relevant but I was never able to correctly set the anchors as described: https://github.com/AlexeyAB/darknet/issues/3372#issuecomment-500391029

LukeAI commented 5 years ago

Maybe - I notice in the paper that gains were greater for deeper models - maybe yolov3-tiny is too small to be able to extract the latent info from the mixups without just being confused by them.

AlexeyAB commented 5 years ago

@LukeAI

maybe yolov3-tiny is too small to be able to extract the latent info from the mixups without just being confused by them.

Yes, so may be it will increase mAP for yolo_v3_spp_pan_scale.cfg.txt or yolo_v3_spp_pan.cfg.txt

LukeAI commented 5 years ago

Yes, so may be it will increase mAP for yolo_v3_spp_pan_scale.cfg.txt or yolo_v3_spp_pan.cfg.txt

If I get the GPU time to try it, I will do so and report back here.

WongKinYiu commented 5 years ago

@AlexeyAB

AlexeyAB commented 5 years ago

@WongKinYiu Can you show paper?

LukeAI commented 5 years ago

https://arxiv.org/abs/1905.04899

WongKinYiu commented 5 years ago

Thanks for sharing.

AlexeyAB commented 4 years ago

@WongKinYiu I added MixUp and CutMix for Classifier training: https://github.com/AlexeyAB/darknet/issues/4419

WongKinYiu commented 4 years ago

@AlexeyAB Great!

I need about 2~3 weeks to train a classifier.

AlexeyAB commented 4 years ago

@WongKinYiu

I added all 5 features from the BoF (Bag of Freebies). Have you tested them?

Also there is new implementation of mosaic=1 for Detector that should be better: https://github.com/AlexeyAB/darknet/issues/4264#issuecomment-562934711

WongKinYiu commented 4 years ago

@AlexeyAB Hello, i m still on a holiday.

could u plz give an example of [net] and [yolo] layers for suggested data aug and ciou norm hyper-parameters? (i m training a model with [net] mosaic=1 and [yolo] ciou_loss, iou_n=0.07, uc_n=0.07) thanks.

by the way, cutmix performs much better than mixup for training a classifier currently.

AlexeyAB commented 4 years ago

@WongKinYiu

[net]
label_smooth_eps=0.1 # for training classifier

mosaic=1

# for Detector
learning_rate=0.112 
momentum=0.949
policy=sgdr
sgdr_cycle=1000    # set the same as max_batches=
sgdr_mult=2

[yolo]
nms_kind = diounms
beta_nms = 0.6

scale_x_y = 1.05
label_smooth_eps=0.1

iou_thresh=0.213

iou_normalizer = 0.1
uc_normalizer = 0.1
iou_loss=ciou

jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

WongKinYiu commented 4 years ago

@AlexeyAB thanks!

WongKinYiu commented 4 years ago

@AlexeyAB

for training classifier loss becomes very huge when setting label_smooth_eps=0.1: 7xxx.x avg. is it normal? (without setting label_smooth_eps=0.1: 6.x avg.)

and do u mean i should change 1000 to same as max_batches? sgdr_cycle=1000 # set the same as max_batches= for example sgdr_cycle=800000

AlexeyAB commented 4 years ago

@WongKinYiu

for training classifier loss becomes very huge when setting label_smooth_eps=0.1: 7xxx.x avg. is it normal? (without setting label_smooth_eps=0.1: 6.x avg.)

I don't know. Theoretically Loss should be the same, since (label_smooth_eps / (classes - 1)) * (classes - 1) + (1 - label_smooth_eps) = 1 in https://github.com/AlexeyAB/darknet/commit/2a873f34485c75d44a346f92ba7dcf2e2aa57a15#diff-2ceac7e68fdac00b370188285ab286f7R526

The only thing I did not fully understand from smoothing loss paper,

or we just have to set the probabilities (1 - label_smooth_eps) for truth and (label_smooth_eps / (classes - 1)) for other classes, as I done https://github.com/AlexeyAB/darknet/commit/2a873f34485c75d44a346f92ba7dcf2e2aa57a15#diff-2ceac7e68fdac00b370188285ab286f7R526
or we should use thresholds (1 - label_smooth_eps) for truth and (label_smooth_eps / (classes - 1)) for other classes, there https://github.com/AlexeyAB/darknet/blob/7ae1ae5641b549ebaa5c816701c4b9ca73247a65/src/blas_kernels.cu#L781-L790
- if(truth && p < (1 - label_smooth_eps)) delta = 1 - p;
- else if (!truth && p < (label_smooth_eps / (classes - 1))) delta = 0 - p;

and do u mean i should change 1000 to same as max_batches? sgdr_cycle=1000 # set the same as max_batches= for example sgdr_cycle=800000

Yes, so it will have only 1 cycle, therefore, you don’t have to calculate the max butches so that the training ends exactly at the end of one of the cycles.

Also I fixed that sgdr_cycle=max_batches by default, if you don't specify sgdr_cycle= in cfg: https://github.com/AlexeyAB/darknet/commit/764872a190d83b6f149220c6ec8aa0df1d2d5e49#diff-bfbbcdf73459e9ea8fb4afa8455ce74dR909

WongKinYiu commented 4 years ago

@AlexeyAB

Thanks, i ll check the training stats tomorrow and feedback to you.

WongKinYiu commented 4 years ago

@AlexeyAB

Although the loss continuous increases to 9xxx, the behavior seems normal.

AlexeyAB commented 4 years ago

@WongKinYiu

the behavior seems normal.

Do you mean that accuracy is increasing?

WongKinYiu commented 4 years ago

@AlexeyAB yes.

glenn-jocher commented 4 years ago

@WongKinYiu @AlexeyAB FYI I found in the TensorFlow code two different implementations of label smoothing:

Categorical Cross Entropy: y_true * (1.0 - label_smoothing) + (label_smoothing / num_classes)
Binary Cross Entropy y_true * (1.0 - label_smoothing) + 0.5 * label_smoothing

where I assume label_smoothing=0.1 is a typical smoothing value

AlexeyAB commented 4 years ago

@glenn-jocher @WongKinYiu

Yes, previouse we use (1) in both Classifier and Detector.

Now we use (1) for Classifier (Softmax) and (2) for Detector (Logistic)

glenn-jocher commented 4 years ago

@AlexeyAB ah perfect then!

WongKinYiu commented 4 years ago

get nan when set label_smooth_eps=0.1 in [yolo] layers. also, new cfg-file: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt seems can not converge.

vb123er951 commented 4 years ago

Does anyone knows how to use CmBN in cspdarknet53?

5377

and another question about label smoothing, should I add label_smooth_eps in both [net] and all [yolo] ?

class label smoothing - +0.5 mAP - use label_smooth_eps=0.1 or 0.01 in the [yolo] or [Gaussian_yolo] layers - is added: 318919e and [net] label_smooth_eps=0.1 for Classifier - is added: 2a873f3

WongKinYiu commented 4 years ago

Does anyone knows how to use CmBN in cspdarknet53?

change all of batch_normalization=1 to batch_normalization=2.

vb123er951 commented 4 years ago

change all of batch_normalization=1 to batch_normalization=2.

thanks for reply, I did this by using yolov4-custom.cfg

but as show in picture, the IOU and GIOU are 0, also Class, Obj, No Obj are 0

AlexeyAB / darknet

BoF (Bag of Freebies) - Visually Coherent Image Mixup ~+4 AP@[.5, .95] #3272

5377