Closed dkashkin closed 3 years ago
[net]
mixup=1
can be used in Classifier and Detector.
BoF (Bag of Freebies) includes 5 features:
synchronize BN
- +0.5 mAP
-requires a large number of GPUs and a long training time - increases mini_batch_size by reducing the training speed. Instead we can use GPU-processing + CPU-RAM for large mini_batch_size - it reduces training speed too, but doesn't require many GPUs: https://github.com/AlexeyAB/darknet/issues/4386
random training shapes
- +1.0 mAP
, is implemented random=1
https://github.com/AlexeyAB/darknet/blob/3aa2e45ad369c72622f9458b6ebc7abb24226879/cfg/yolov3-spp.cfg#L821
cosine lr schedule
- +0.5 mAP
- is implemented SGDR: https://github.com/AlexeyAB/darknet/pull/2651
policy=sgdr
#sgdr_cycle=1000 # you can set 1000 or just comment this line
#sgdr_mult=2
class label smoothing
- +0.5 mAP
- use label_smooth_eps=0.1
or 0.01
in the [yolo]
or [Gaussian_yolo]
layers - is added: https://github.com/AlexeyAB/darknet/commit/318919e1cbb362aac6cb5c3d9388735f9ab594b6
and [net] label_smooth_eps=0.1
for Classifier - is added: https://github.com/AlexeyAB/darknet/commit/2a873f34485c75d44a346f92ba7dcf2e2aa57a15
mixup
- +1.5 mAP
- is implemented [net] mixup=1
- this thread https://github.com/AlexeyAB/darknet/issues/3272
or even better [net] mosaic=1
- https://github.com/AlexeyAB/darknet/issues/4432 and https://github.com/AlexeyAB/darknet/issues/4264
@dkashkin Hi, It looks like a good and universal solution.
Mixup should be used also for darknet53.conv pre-trained weights:
@AlexeyAB yes this sounds great. I like the simplicity of this idea - it's just an additional augmentation strategy that can be implemented in a few lines of code. The question is - can it be easily supported in Darknet training? Ideally, this should be an optional line in the config file...
@dkashkin Yes, may be, there are required only 2 lines in cfg-file:
mixup=1
freebies_alpha_beta=1.5
for B(1.5, 1.5), weighted loss.But it can be implemented in several lines only on Python - I just try to understand what do they do for mixup (bag of freebies), so as not to miss important details and points: https://arxiv.org/pdf/1710.09412v2.pdf
Where is numpy.random.beta()
: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.beta.html
Beta distribution: https://www.astroml.org/book_figures/chapter3/fig_beta_distribution.html And wiki: https://en.wikipedia.org/wiki/Beta_distribution
Beta-distribution on C will be something like this - it has not even the most complex implementation :) https://github.com/numpy/numpy/issues/688
Beta-distribution on C++: https://gist.github.com/sftrabbit/5068941 SO: https://stackoverflow.com/questions/15165202/random-number-generator-with-beta-distribution
#include <iostream>
#include <sstream>
#include <string>
#include <random>
namespace sftrabbit {
template <typename RealType = double>
class beta_distribution
{
public:
typedef RealType result_type;
class param_type
{
public:
typedef beta_distribution distribution_type;
explicit param_type(RealType a = 2.0, RealType b = 2.0)
: a_param(a), b_param(b) { }
RealType a() const { return a_param; }
RealType b() const { return b_param; }
bool operator==(const param_type& other) const
{
return (a_param == other.a_param &&
b_param == other.b_param);
}
bool operator!=(const param_type& other) const
{
return !(*this == other);
}
private:
RealType a_param, b_param;
};
explicit beta_distribution(RealType a = 2.0, RealType b = 2.0)
: a_gamma(a), b_gamma(b) { }
explicit beta_distribution(const param_type& param)
: a_gamma(param.a()), b_gamma(param.b()) { }
void reset() { }
param_type param() const
{
return param_type(a(), b());
}
void param(const param_type& param)
{
a_gamma = gamma_dist_type(param.a());
b_gamma = gamma_dist_type(param.b());
}
template <typename URNG>
result_type operator()(URNG& engine)
{
return generate(engine, a_gamma, b_gamma);
}
template <typename URNG>
result_type operator()(URNG& engine, const param_type& param)
{
gamma_dist_type a_param_gamma(param.a()),
b_param_gamma(param.b());
return generate(engine, a_param_gamma, b_param_gamma);
}
result_type min() const { return 0.0; }
result_type max() const { return 1.0; }
RealType a() const { return a_gamma.alpha(); }
RealType b() const { return b_gamma.alpha(); }
bool operator==(const beta_distribution<result_type>& other) const
{
return (param() == other.param() &&
a_gamma == other.a_gamma &&
b_gamma == other.b_gamma);
}
bool operator!=(const beta_distribution<result_type>& other) const
{
return !(*this == other);
}
private:
typedef std::gamma_distribution<result_type> gamma_dist_type;
gamma_dist_type a_gamma, b_gamma;
template <typename URNG>
result_type generate(URNG& engine,
gamma_dist_type& x_gamma,
gamma_dist_type& y_gamma)
{
result_type x = x_gamma(engine);
return x / (x + y_gamma(engine));
}
};
template <typename CharT, typename RealType>
std::basic_ostream<CharT>& operator<<(std::basic_ostream<CharT>& os,
const beta_distribution<RealType>& beta)
{
os << "~Beta(" << beta.a() << "," << beta.b() << ")";
return os;
}
template <typename CharT, typename RealType>
std::basic_istream<CharT>& operator>>(std::basic_istream<CharT>& is,
beta_distribution<RealType>& beta)
{
std::string str;
RealType a, b;
if (std::getline(is, str, '(') && str == "~Beta" &&
is >> a && is.get() == ',' && is >> b && is.get() == ')') {
beta = beta_distribution<RealType>(a, b);
} else {
is.setstate(std::ios::failbit);
}
return is;
}
}
void data_augmentation(...) {
std::random_device rd;
std::mt19937 gen(rd());
// beta_val1 = 1.5, beta_val2 = 1.5 for B(1.5, 1.5)
sftrabbit::beta_distribution<double> beta_distr_obj(beta_val1, beta_val2);
double beta_distribution = beta_distr_obj(gen);
float alpha_blend = beta_distribution ;
float beta_blend= 1 - beta_distribution;
cv::addWeighted( src1, alpha_blend, src2, beta_blend, 0.0, dst); // mixup images
fuse_labels(src_label1, alpha_blend, src_label2, beta_blend, new_label); // mixup labels
}
The implementation of mixup training is straightforward, and introduces a minimal computation overhead. Figure 1a shows the few lines of code necessary to implement mixup training in PyTorch. Finally, we mention alternative design choices. First, in preliminary experiments we find that convex combinations of three or more examples with weights sampled from a Dirichlet distribution does not provide further gain, but increases the computation cost of mixup. Second, our current implementation uses a single data loader to obtain one minibatch, and then mixup is applied to the same minibatch after random shuffling. We found this strategy works equally well, while reducing I/O requirements. Third, interpolating only between inputs with equal label did not lead to the performance gains of mixup discussed in the sequel. More empirical comparison can be found in Section 3.8.
What is mixup doing? The mixup vicinal distribution can be understood as a form of data augmentation that encourages the model f to behave linearly in-between training examples. We argue that this linear behaviour reduces the amount of undesirable oscillations when predicting outside the training examples. Also, linearity is a good inductive bias from the perspective of Occam’s razor, since it is one of the simplest possible behaviors. Figure 1b shows that mixup leads to decision boundaries that transition linearly from class to class, providing a smoother estimate of uncertainty. Figure 2 illustrate the average behaviors of two neural network models trained on the CIFAR-10 dataset using ERM and mixup. Both models have the same architecture, are trained with the same procedure, and are evaluated at the same points in-between randomly sampled training data. The model trained with mixup is more stable in terms of model predictions and gradient norms in-between training samples.
Also what do they do for LSR (class label smoothing), very academically written: https://arxiv.org/pdf/1512.00567v3.pdf
@AlexeyAB sorry for delay! I missed your reply. It might be more reliable to discuss this over email (kashkin at gmail). I agree with you - there are some papers that describe mixup via heavy math resulting in unnecessary complexity. I like the following visual explanation much better: Assuming that this infographic captures all the important concepts, I hope this augmentation should be easy to implement. If darknet uses OpenCV, the alpha blending can be done by calling cv2.addWeighted...
@dkashkin Yes, may be we can try to implement mixup with fixed alpha_blending = 1 - alpha_blending =0.5
and without weighted loss. It will be much more simpler:
@AlexeyAB I think this would be a great starting point! P.S. I have one YOLO-specific idea that might be also interesting to test. Since YOLO training datasets usually include a lot of unlabeled images, we could restrict the mixup algorithm to blending each "labeled" image with a randomly selected "unlabeled" image. This approach solves the need to add noise to the training images without creating any new overlapping bounding boxes. My assumption is that the classic mixup strategy can cause some classification problems. For example, if you blend a cat photo with a dog photo, you might end up with one training image that has two identical bounding boxes with different labels (cat and dog). I am afraid, such images can make it harder for the neural network to differentiate cats and dogs. I would not be surprised if mixup strategy that avoids such overlaps can outperform the original mixup algorithm from the whitepaper...
@dkashkin Hi,
I added Mixup data augmentation.
You should add just 1 parameter mixup=1
in the [net]
section, and in 50% of images will be used Mixup with alpha=beta=0.5
for images, without weighted loss.
If you want to see result of data augmentation use flag -show_imgs
in training command:
Thanks @AlexeyAB ! I'm running a test now, I'll report back on the results shortly. Question: Presumably it would be a bad idea to set this option with the LSTM models? As I imagine this would mess up the frame to frame continuity which the model depends on?
Running the latest repo with the mixup option is causing the process to be "killed":
This happened twice, the last time was on iteration 345.
2080Ti compiled with opencv, mixed prescision
my_stuff/train.sh: line 1: 13095 Killed ./darknet detector train my_stuff/obj.data my_stuff/yolov3-tiny_3l.cfg my_stuff/yolov3-tiny.conv.15 -dont_show -mjpeg_port 8090 -map -letter_box
train.sh ./darknet detector train my_stuff/obj.data my_stuff/yolov3-tiny_3l.cfg my_stuff/yolov3-tiny.conv.15 -dont_show -mjpeg_port 8090 -map -letter_box
[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=4
width=544
height=544
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
mixup=1
learning_rate=0.001
burn_in=1000
max_batches = 16000
policy=steps
steps=12800,14400
scales=.1,.1
An identical model that I trained yesterday was fine, the only difference is that yesterday's model used the repo. as it existed yesterday and didn't have the mixup flag set.
I just tried running the same again without -letter_box and it crashed in the same way. Running again now without mix_up=1 and it seems to be going fine so far.
@LukeAI Hi, can you share your cfg-file and dataset if it isn't private?
How many CPU-RAM do you have?
Question: Presumably it would be a bad idea to set this option with the LSTM models? As I imagine this would mess up the frame to frame continuity which the model depends on?
In general, I think it should work, since will be mixed-up two sequences (not just random images).
The dataset is a subset of Google OpenImages yolov3-tiny_3l.cfg.txt train.txt
@LukeAI
Running the latest repo with the mixup option is causing the process to be "killed":
I fixed it.
Hey all - just to feedback, I found that mixup hurt my AP slightly in all classes when using the above cfg and Dataset at the final validation. (I didn't try older weights but the chart.png looked fairly flat.)
For some reason, mixup led to a very similar validation accuracy but a higher, noisier loss.
Just to share my results from the kitti dataset trained on yolov3_tiny_3l. The high Loss, lower validation is the one with mixup=1
@LukeAI
As I see there are only training
(labeled) and testing(un-labeled) folders in the data_object_image_2.zip
.
Did you use Training dataset for validation?
What script did you use to convert labels to Yolo format?
Did you try to train LSTM-model on Kitty-tracking? http://www.cvlibs.net/datasets/kitti/eval_mots.php
I randomly split the labelled kitti OBJECT2D dataset into 85% training, 15% testing
I wrote my own script to do the conversions. kitti2yolo.py.txt
@LukeAI So it seems that Mixup doesn't increase mAP for the most cases, or it requires more iterations. Or it mixup should be used for training Classification model that will be used as pre-trained weights for training Detector.
yolov3-tiny_3l.cfg.txt This is the config I was using, don't know if it's too relevant but I was never able to correctly set the anchors as described: https://github.com/AlexeyAB/darknet/issues/3372#issuecomment-500391029
Maybe - I notice in the paper that gains were greater for deeper models - maybe yolov3-tiny is too small to be able to extract the latent info from the mixups without just being confused by them.
@LukeAI
maybe yolov3-tiny is too small to be able to extract the latent info from the mixups without just being confused by them.
Yes, so may be it will increase mAP for yolo_v3_spp_pan_scale.cfg.txt or yolo_v3_spp_pan.cfg.txt
Yes, so may be it will increase mAP for yolo_v3_spp_pan_scale.cfg.txt or yolo_v3_spp_pan.cfg.txt
If I get the GPU time to try it, I will do so and report back here.
@AlexeyAB
@WongKinYiu Can you show paper?
Thanks for sharing.
@WongKinYiu I added MixUp and CutMix for Classifier training: https://github.com/AlexeyAB/darknet/issues/4419
@AlexeyAB Great!
I need about 2~3 weeks to train a classifier.
@WongKinYiu
I added all 5 features from the BoF (Bag of Freebies). Have you tested them?
Also there is new implementation of mosaic=1
for Detector that should be better: https://github.com/AlexeyAB/darknet/issues/4264#issuecomment-562934711
@AlexeyAB Hello, i m still on a holiday.
could u plz give an example of [net] and [yolo] layers for suggested data aug and ciou norm hyper-parameters? (i m training a model with [net] mosaic=1 and [yolo] ciou_loss, iou_n=0.07, uc_n=0.07) thanks.
by the way, cutmix performs much better than mixup for training a classifier currently.
@WongKinYiu
[net]
label_smooth_eps=0.1 # for training classifier
mosaic=1
# for Detector
learning_rate=0.112
momentum=0.949
policy=sgdr
sgdr_cycle=1000 # set the same as max_batches=
sgdr_mult=2
[yolo]
nms_kind = diounms
beta_nms = 0.6
scale_x_y = 1.05
label_smooth_eps=0.1
iou_thresh=0.213
iou_normalizer = 0.1
uc_normalizer = 0.1
iou_loss=ciou
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
@AlexeyAB thanks!
@AlexeyAB
for training classifier
loss becomes very huge when setting label_smooth_eps=0.1
: 7xxx.x avg.
is it normal? (without setting label_smooth_eps=0.1
: 6.x avg.)
and do u mean i should change 1000 to same as max_batches?
sgdr_cycle=1000 # set the same as max_batches=
for example sgdr_cycle=800000
@WongKinYiu
for training classifier loss becomes very huge when setting label_smooth_eps=0.1: 7xxx.x avg. is it normal? (without setting label_smooth_eps=0.1: 6.x avg.)
I don't know.
Theoretically Loss should be the same, since (label_smooth_eps / (classes - 1)) * (classes - 1) + (1 - label_smooth_eps) = 1
in https://github.com/AlexeyAB/darknet/commit/2a873f34485c75d44a346f92ba7dcf2e2aa57a15#diff-2ceac7e68fdac00b370188285ab286f7R526
The only thing I did not fully understand from smoothing loss paper,
or we just have to set the probabilities (1 - label_smooth_eps)
for truth and (label_smooth_eps / (classes - 1))
for other classes, as I done https://github.com/AlexeyAB/darknet/commit/2a873f34485c75d44a346f92ba7dcf2e2aa57a15#diff-2ceac7e68fdac00b370188285ab286f7R526
or we should use thresholds (1 - label_smooth_eps)
for truth and (label_smooth_eps / (classes - 1))
for other classes, there https://github.com/AlexeyAB/darknet/blob/7ae1ae5641b549ebaa5c816701c4b9ca73247a65/src/blas_kernels.cu#L781-L790
if(truth && p < (1 - label_smooth_eps)) delta = 1 - p;
else if (!truth && p < (label_smooth_eps / (classes - 1))) delta = 0 - p;
and do u mean i should change 1000 to same as max_batches? sgdr_cycle=1000 # set the same as max_batches= for example sgdr_cycle=800000
Yes, so it will have only 1 cycle, therefore, you don’t have to calculate the max butches so that the training ends exactly at the end of one of the cycles.
Also I fixed that sgdr_cycle=max_batches
by default, if you don't specify sgdr_cycle=
in cfg:
https://github.com/AlexeyAB/darknet/commit/764872a190d83b6f149220c6ec8aa0df1d2d5e49#diff-bfbbcdf73459e9ea8fb4afa8455ce74dR909
@AlexeyAB
Thanks, i ll check the training stats tomorrow and feedback to you.
@AlexeyAB
Although the loss continuous increases to 9xxx, the behavior seems normal.
@WongKinYiu
the behavior seems normal.
Do you mean that accuracy is increasing?
@AlexeyAB yes.
@WongKinYiu @AlexeyAB FYI I found in the TensorFlow code two different implementations of label smoothing:
y_true * (1.0 - label_smoothing) + (label_smoothing / num_classes)
y_true * (1.0 - label_smoothing) + 0.5 * label_smoothing
where I assume label_smoothing=0.1
is a typical smoothing value
@glenn-jocher @WongKinYiu
Yes, previouse we use (1) in both Classifier and Detector.
Now we use (1) for Classifier (Softmax) and (2) for Detector (Logistic)
@AlexeyAB ah perfect then!
get nan when set label_smooth_eps=0.1
in [yolo]
layers.
also, new cfg-file: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt
seems can not converge.
Does anyone knows how to use CmBN in cspdarknet53?
and another question about label smoothing, should I add label_smooth_eps in both [net] and all [yolo] ?
class label smoothing - +0.5 mAP - use label_smooth_eps=0.1 or 0.01 in the [yolo] or [Gaussian_yolo] layers - is added: 318919e and [net] label_smooth_eps=0.1 for Classifier - is added: 2a873f3
Does anyone knows how to use CmBN in cspdarknet53?
change all of batch_normalization=1
to batch_normalization=2
.
change all of
batch_normalization=1
tobatch_normalization=2
.
thanks for reply, I did this by using yolov4-custom.cfg
but as show in picture, the IOU and GIOU are 0, also Class, Obj, No Obj are 0
This is a question, not a bug. @AlexeyAB is there a way to incorporate the "Visually Coherent Image Mixup" augmentation strategy? Amazon recently published a paper that claims that this approach results in a massive mAP improvement for YOLOv3: https://arxiv.org/abs/1902.04103 Needless to say, I would love a chance to try this idea and I'd be happy to share my results.