Swish activation instead of Leaky in darknet +1% Top1 (or + ~1% mAP@0.5)

primepake commented 5 years ago

I read https://arxiv.org/pdf/1710.05941.pdf and found that swish activation works well in EfficienetNet so replace Leaky with Swish ?

AlexeyAB commented 5 years ago

For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2

May be will be added

LukeAI commented 5 years ago

I notice that you added Swish! How can I try it out? If I just substitute all occurences of "leaky" with "swish" in yolo-v3-spp.cfg by running sed 's/leaky/swish/' yolo_v3_spp.cfg will that do it?

AlexeyAB commented 5 years ago

@LukeAI

If I just substitute all occurences of "leaky" with "swish" in yolo-v3-spp.cfg by running sed 's/leaky/swish/' yolo_v3_spp.cfg will that do it?

Yes

AlexeyAB commented 5 years ago

@LukeAI Hi,

What does it mean Full Prescision ?

What is the difference between Full Prescision and Swish?

There is no difference between attached files:

yolo_v3_spp.cfg.txt yolo_v3_spp_swish.cfg.txt

LukeAI commented 5 years ago

By full prescision I mean trained without mixed precision ie. using a recent version of the repo. ok it looks like I missed out the all important "-i" in my sed command so yes, I didn't actually train a swish network I just trained a copy of the baseline... I'll report back here in a couple of days... Not sure why there is that .21% difference.

AlexeyAB commented 5 years ago

@LukeAI

Not sure why there is that .21% difference.

Its random fluctuation.

By full prescision I mean trained without mixed precision ie. using a recent version of the repo.

Did you get the mAP lower with FP32 (61%) than with FP16 (64%)?

LukeAI commented 5 years ago

The difference in the two graphs above is mostly because one only shows 10000 iterations and the other shows 25000 iterations. It was about 4% lower when I trained with cudnn_half: chart-9-16

LukeAI commented 5 years ago

swish

Swish gave me a 1% AP improvement! Nice. dunno if it comes with a higher computational cost or whatever but probably worth it!

LukeAI commented 5 years ago

regular yolov3-spp with leaky for comparison: full_prescision (I deleted my previous post where I made a mistake to avoid confusion.)

AlexeyAB commented 5 years ago

@LukeAI Thanks! It seems swish works weel and give + ~1%mAP and + ~1% Top1 accuracy as stated.

ChenCong7375 commented 5 years ago

so I can change [leaky] to [swish] in conv layers to get improvement ?

AlexeyAB commented 5 years ago

@ChenCong7375 Yes, and train from scratch.

LukeAI commented 5 years ago

what do you mean train from scratch? I trained from the regular darknet53.conv.74 - do you mean not using pretrained weights at all?

AlexeyAB commented 5 years ago

@LukeAI

I think yes, you should train withtout pre-trained weights, because no one pre-trained weights is trained with swish activation. But you can try both cases.

LukeAI commented 5 years ago

Preliminary results for pan2_tiny with "leaky" replaced with "swish". I started with the pretrained weights (using leaky relu) so It looks like I need to train for longer to realise the potential of swish with this network.

pan2: pan2

pan2 with swish pan2_swish

AlexeyAB commented 5 years ago

@LukeAI

I started with the pretrained weights (using leaky relu) so It looks like I need to train for longer to realise the potential of swish with this network.

Yes. Or may be even you should train Classifier with swish on ImageNet, and only then use this pre-trained weights to train Detector with swish.

glenn-jocher commented 5 years ago

@LukeAI @AlexeyAB I was considering adding Swish or PRELU activations to our PyTorch YOLOv3 repo (https://github.com/ultralytics/yolov3). It seems like the conclusion on the topic is that Swish improves ~1% mAP when training from randomly initialized weights, but actually hurts results significantly when training from a darknet53.conv.74 backbone. Is this correct?

Do you have any experience applying PRELU(0.1) in place of LeakyRELU(0.1)? PRELU would allow the use of the existing darknet53.conv.74 backbone when all the slope parameters are initialized to 0.1 values.

AlexeyAB commented 5 years ago

@glenn-jocher Just because darknet53.conv.74 is trained with ReLU.

So you should

train darknet53.cfg (224x224 (Swish instead of ReLU)) on ImageNet,
then by using darknet53_last.weights train darknet53_448.cfg (448x448 (Swish instead of ReLU)) on ImageNet
then do partial to get darknet53.conv.74
and then train yolov3-spp.cfg with (Swish instead of ReLU)

glenn-jocher commented 5 years ago

@AlexeyAB thanks for the ideas. So darknet yolov3 is trained this way, from imagenet twice at 224 and 448 and then using that to create darknet53.conv.74? This works well even though imagenet is a classification only dataset?

In my recent tests on ultralytics/yolov3, only swish seemed to provide any sort of performance improvement. I'm not sure the performance improvement justifies the extra memory requirements though. Inference speed is unaffected, but since swish is not an in-place operation I observed GPU ram increase by about 30% after replacing in-place LRELU with it. See https://github.com/ultralytics/yolov3/issues/441#issuecomment-520229791

LukeAI commented 5 years ago

@glenn-jocher swish didn't hurt performance when using pretrained weights, it gave an extra +1%, it just took a lot longer to converge because the pretrained weights were trained with leaky-relu. I believe that pretrained weights from imagenet, using swish would converge much faster and have higher AP because when you have to train for so long to catch up with the new activations there are probably some aspects of the network that are overfitting.

LukeAI commented 5 years ago

something else that may help in getting better pretrained weights for transferring from is the procedure described here: https://openreview.net/forum?id=Bygh9j09KX I'm currently running an experiment to see if this improves tiny-yolo - even with tiny it takes a really really long time tho. I'll share the results and weights when I have them.

AlexeyAB commented 5 years ago

@glenn-jocher

@AlexeyAB thanks for the ideas. So darknet yolov3 is trained this way, from imagenet twice at 224 and 448 and then using that to create darknet53.conv.74? This works well even though imagenet is a classification only dataset?

Yes.

In my recent tests on ultralytics/yolov3, only swish seemed to provide any sort of performance improvement. I'm not sure the performance improvement justifies the extra memory requirements though. Inference speed is unaffected, but since swish is not an in-place operation I observed GPU ram increase by about 30% after replacing in-place LRELU with it. See ultralytics/yolov3#441 (comment)

So

swish requires +50% more GPU
PRELU requires +30% more GPU

https://github.com/ultralytics/yolov3/issues/441

scale_xy=1.2: 42.3 > 43.4 > 44.3

What does it mean, these 3 values?

glenn-jocher commented 5 years ago

@AlexeyAB yes, in pytorch LeakyReLU can run 'in-place'. PReLU and Swish can not run in-place, a copy of the data is created for the operation, increasing peak RAM usage.

The 42.3 > 43.4 > 44.3 values are (--conf-thres 0.1 mAP) > (--conf-thres 0.001 mAP) > (--conf-thres 0.001 mAP with pycocotools). Basically only use the last value (44.3 mAP) for comparison. The others are lower because our own ultralytics/yolov3 mAP code doesn't exactly match up to pycocotools (not sure why), and we test on 0.1 conf-thres to speed up testing during training.

@LukeAI yeah I saw that effect in the AlexeyAB Swish plots, that the swish benefit took a long time to realize. Its funny, because on ultralytics/yolov3 the swish gain is largest at the beginning, but then seems to disappear over time (trending towards the default LeakyReLU mAP as epochs increase).

BTW, independent of this discussion, it seems that full training (default settings, LeakyReLU etc) shows no benefit of using a backbone on ultralytics/yolov3. Two users trained yolov3-320 fully to 273 epochs and ended up within 1% mAP of each other, one with the darknet53.conv.74 backbone (49.3 mAP) and one without (50.5 mAP). Clearly the early results show a huge backbone boost, but the final results converged similarly.

LukeAI commented 5 years ago

Yeah, this paper suggests that imagenet pretraining might not be worth it: https://arxiv.org/abs/1811.08883 but there is a response paper which i can't find that suggests pretraining does help in some ways, in some cases. Faster convergence is obviously good but beyond that, better AP if your dataset is very small and supposedly better tolerance to dataset imbalance. (possibly related to the notion of overfitting in some aspects whilst still converging in others?) It's interesting that somebody actually got worse results using pretrained weights. was everything else identical? dataset etc.

glenn-jocher commented 5 years ago

@AlexeyAB the exact GPU memory requirements on ultralytics/yolov3 are:

nn.LeakyReLU(0.1, inplace=True) 10.7G
nn.LeakyReLU(0.1, inplace=False) 13.0G (+22%)
nn.PReLU(num_parameters=1, init=0.10) 13.0G
def forward(self, x): return x * torch.sigmoid(x) swish 14.2G (+33%)
def forward(self, x): x *= torch.sigmoid(x); return x swish 14.5G (+36%)

python3 train.py --img-size 320 --batch-size 64 --accumulate 1  # test command

glenn-jocher commented 5 years ago

@LukeAI I suspect he actually got nearly the exact same results. He may not have used the actual pycocotools mAP, which generally reports about 1% higher than our in-house mAP. So rather than 49.3 his result is probably 50.3 in an apples to apples comparison, which is basically the same as 50.5. I asked him to check this but he has not responded yet.

These mAP intricacies of ours are creating confusion unfortunately. I think I am going to add code to our repo to automatically try pycocotools mAP if training completes, and to warn if not pycocotools is not installed.

glenn-jocher commented 5 years ago

@LukeAI ok can confirm now that the proper comparison is 50.2 with backbone (not 49.3) (https://github.com/ultralytics/yolov3/issues/310#issuecomment-521239832) vs 50.5 without backbone.

glenn-jocher commented 5 years ago

@LukeAI in my general experience, as long as the LR remains the same, a network will adapt from it's starting point to whatever minimizes the loss function, so I would not be surprised if a wide variety of initial conditions all converge the similar final results after enough epochs (273 epochs certainly seems like enough).

The greatest issue to me seems to be the dependence on hyperparameters, as they seem to take on rather arbitrary values, and evolving using a short-baseline fitness (i.e. mAP after epoch 0) is fast, but may not produce the best final results, since I've also observed a general negative correlation between short term gains and long term gains (i.e. what works best for the first few epochs may end up overtraining after 200 epochs). The alternative is to evolve hyps based on longer-baseline fitnesses (i.e. mAP after 27 epochs as I'm doing now), but this is painfully slow. As a rule of thumb our current evolution code needs at least 100 generations to converge to a minimum.

Also I see no reason why intial hyperparameters should not change over time. Currently LR seems to be the only hyp that is generally scheduled, but ideally we'd evolve hyp0 and hypFinal and schedule all of them over the course of training.

The number of possibilities one might consider is enough to give you a headache. Only companies like Google and Nvidia are in a position to actually test them all, the rest of us are working with hardware scraps that make it hard to get ahead :(

glenn-jocher commented 5 years ago

@LukeAI @AlexeyAB I just made a bit of a discovery today. I was trying to implement focal loss, and in the process I discovered that signals going to YOLO layer detection neurons should be initialized around -5 to -10, to represent intiali probabilities of 0.01 - 0.0001 . PyTorch defaults bias initializations to zero mean gaussians, which means that the first batch the mean probability of a YOLO detection neuron will be sigmoid(0) = 0.5 probability of a detection/classification, which is extremely high for all 10,000 objectness neurons to emit all at once. This causes instability in the early stages of training, which can be fixed by initializing to about -5 bias, giving sigmoid(-5) = 0.01, or 1% detection probability. I tested this experiment on our tiny 16-image dataset and noticed dramatic improvements in nearly all aspects of training, not just at the beginning but in the final results as well. I'm going to implement this as a permanent change in the next few days.

Alexey, I don't know if you are doing this already, but if not you should definitely try to implement. The relevant section is 3.3 in the Focal Loss paper. See https://github.com/ultralytics/yolov3/issues/460

AlexeyAB commented 5 years ago

I tested this experiment on our tiny 16-image dataset and noticed dramatic improvements in nearly all aspects of training, not just at the beginning but in the final results as well. I'm going to implement this as a permanent change in the next few days.

@glenn-jocher Hi,

Did you get these charts for 0 bias and -5 bias, both with focal_loss ?
What increase in accuracy (mAP) did you get with focal_loss and without focal_loss?

glenn-jocher commented 5 years ago

@AlexeyAB ha, well focal loss itself is a different story. The two runs above don't use focal loss, they use the default losses (BCE for obj and cls). The only difference is the bias initialization going into the YOLO layer detection neurons. The orange 'experiment' subtracts 5 from the default bias initialization of 0 (+/- some white noise). This leads all the detection neurons to output about 0.01 in the first batch rather than 0.50 on average.

Focal loss itself I need to work on some more. The initial results look good (maybe +0.5 to 1.0 mAP@50 on full COCO 320 at 27 epochs), but it's driving my precision down very low, flooding the output with FPs, making the results hard to use. The problem is that once you start weighing outputs differently, like class weights or focal loss, then the confidences coming from the objectness neurons are no longer representative. So for example Focal Loss is causing all the objectness values for detections to spike to 0.99 etc, making it very hard to establish a conf_thresh for testing and inference etc. BCE positive sample weighting has the same effect BTW. So it looks like it has potential, but I'm struggling to exploit it properly at the moment I think.

glenn-jocher commented 5 years ago

@AlexeyAB these are the top 3 results I have. I did not combine swish with focal loss below. These are for 320 size training for 27 epochs on full COCO starting from darknet.53.conv.74. You can see the focal loss run has terrible precision (even though it has the highest mAP), I need to work on it more.

default: 42.5
default + swish: 43.0
default + focal loss: 43.3

results

AlexeyAB commented 5 years ago

@glenn-jocher

So do you get +1% mAP without swish/focall loss, just by adding bias -5 to BCE for classes (not for objectness)?

What PAN version do you use, 1 or 2?

If you get high mAP, high Recall and low Precision, then just use high conf_thresh.

So for example Focal Loss is causing all the objectness values for detections to spike to 0.99 etc

focall_loss shouldn't affect on objectness at all:

there is focal loss: https://github.com/AlexeyAB/darknet/blob/4c315ea26b56c2bf20ebc240d94386c6e3cc83db/src/yolo_layer.c#L190-L208
in other place objectness: https://github.com/AlexeyAB/darknet/blob/4c315ea26b56c2bf20ebc240d94386c6e3cc83db/src/yolo_layer.c#L350

glenn-jocher commented 5 years ago

@AlexeyAB I used yolov3-spp-pan-scale.cfg, which is think is panv1, as @LukeAI said there was no panv2 available for full yolov3-spp.cfg, only tiny.

Yes, with low precision I need to increase my conf-thresh during detection.

My focal loss experiments just started the other day, so they are in the very early stages, but to be honest there are so many options that I don't exactly have a clear picture of how best to apply it. In my repo I can apply it to objectness BCE, and classification BCE both, one one or the other, or not at all, with varying combinations of BCE positive weight applied.

In many cases applying positive sample weights to BCE seems to produce similar results to FocalLoss, and I'm not sure the extent to which I should apply them together (or not). The BCE positive weights seem similar to the Focal Loss alpha parameter, but my evolved positives weights are > 1.0 (< 1.0 would penalize detections).

Essentially my two loss functions for objectness and classification are defined as:

BCEcls = nn.BCEWithLogitsLoss(pos_weight=ft([h['cls_pw']]))  # classification loss with positive weight
BCEobj = nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']]))  # objectness loss with positive weight

and I can wrap these in a Focal Loss class that changes them to Focal Loss (with default alpha 1.0 gamma 2.0):

BCEcls = FocalLoss(nn.BCEWithLogitsLoss(pos_weight=ft([h['cls_pw']]))) # Focal loss with positive weight
BCEobj = FocalLoss(nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']])))  # Focal loss with positive weight

glenn-jocher commented 5 years ago

@AlexeyAB the long term effect (27 epochs result) of updating the classification input biases to -5 I don't know, I haven't run that yet. I think I will do that today. I'd expect it to help slightly or return the same results. In general the theory is very sound though, that you want all your classification and objectness neurons to output near zero (i.e. 0.01) initially to not create a gradient spike in the first batches.

glenn-jocher commented 5 years ago

@AlexeyAB ah!! I have new information!! I loaded up yolov3-spp.weights and looked at the Conv2d() layer before the last YOLO layer to analyze the pretrained network biases in this layer. It turns out I was half right before. The darknet-trained biases into the 3 YOLO layer are:

for l in model.yolo_layers:
    b = model.module_list[l - 1][0].bias.view(3, -1)  # bias 3x85
    print('regression: %.2f+/-%.2f, ' % (b[:, :4].mean(), b[:, :4].std()),
          'objectness: %.2f+/-%.2f, ' % (b[:, 4].mean(), b[:, 4].std()),
          'classification: %.2f+/-%.2f' % (b[:, 5:].mean(), b[:, 5:].std()))

# returns:
regression: 0.00+/-0.08,  objectness: -2.39+/-0.42,  classification: -0.14+/-0.03
regression: -0.00+/-0.06,  objectness: -2.44+/-0.20,  classification: -0.19+/-0.04
regression: -0.06+/-0.10,  objectness: -3.32+/-0.18,  classification: -0.43+/-0.10

This makes perfect sense, as the final layer has a lower probability of objectness per neuron since it has more neurons. So it appears a proper bias initialization for custom networks (using similar yolo loss functions) is about -3 for objectness and -0.3 for classification.

glenn-jocher commented 5 years ago

@AlexeyAB Ok I have epoch 0 results on yolov3-spp 320 (full coco, starting from darknet.53.conv.74 backbone). Default (no bias adjustment) achieves 0.10 mAP. Initializing all obj and cls neurons to -5 achieves 0.168 mAP (!).

Furthermore, if I analyze the biases of the two models after epoch 0, I see that the -5 neurons have trained to -3.8/-3.7 obj/cls, and the 0 neurons have trained to -0.4/-0.1, so they both seem to be trending from their starting points towards the observed trained values of -3/-0.3, albeit very slowly. One epoch is 1833 batches, and 1833 optimizer updates, so these biases appear to take a very long time to converge to their proper values (maybe 10+ coco epochs). So it appears initializing them correctly is more important than I'd thought before.

AlexeyAB commented 5 years ago

@glenn-jocher Thanks! Try to compare final mAP, and if final mAP for bias -5 will be higher, then it seems we should initialize objectness and class_probabilities to -5.

glenn-jocher commented 5 years ago

@AlexeyAB hmm ok, I ran to 27 epochs. I get 44.9 with the -5 bias, the same +0.3 improvement as swish (default is 44.6). So a small change, but it seems worthwhile.

Unfortunately it's the largest change in my https://github.com/ultralytics/yolov3/issues/441 experiments along with swish. Focalloss I need to examine more.

Did you ever make a panv2 for full yolov3?

AlexeyAB commented 5 years ago

@glenn-jocher

So +0.3 mAP without any overhead is a good improvement.

Did you ever make a panv2 for full yolov3?

Not yet. I just don't have a time.

AlexeyAB / darknet

Swish activation instead of Leaky in darknet +1% Top1 (or + ~1% mAP@0.5) #3464