Initializing Conv Weights with Zero

penguinmenac3 commented 5 years ago

Where and What?

At the Code for the default_classification_model (retinanet.py#L66) the weights for the kernel are initialized with zero. According to the Keras Implementation (source code here) it is actually 0.

This might improve model performance.

Why is Zero Initialization a Big Deal?

As explained here: "Why random? Why not initialize them all to 0? An important concept here is called symmetry breaking. If all the neurons have the same weights, they will produce the same outputs and we won't be learning different features. We won't learn different features because during the backpropagation step, all the weight updates will be exactly the same. So starting with a randomized distribution allows us to initialize the neurons to be different (with very high probability) and allows us to learn a rich and diverse feature hierarchy.

Why mean zero? A common practice in machine learning is to zero-center or normalize the input data, such that the raw input features (for image data these would be pixels) average to zero."

Did I Test my Solution?

No, I do not have the resources to test it at the moment.

hgaiser commented 5 years ago

That initialization is taken from their paper[1]. They initialize it with zero weights and non-zero bias values, because it defines a prior probability for the classification distribution (section 4.1, paragraph Initialization).

If you'd set the kernel values to non-zero, you wouldn't encode this prior probability. Whether what they're doing is the best solution, I don't know, but that is what the code is based on.

[1] https://arxiv.org/pdf/1708.02002.pdf

yhenon commented 5 years ago

Here is what the paper says:

All new conv layers except the final
one in the RetinaNet subnets are initialized with bias b = 0
and a Gaussian weight fill with σ = 0.01. For the final conv
layer of the classification subnet, we set the bias initialization
to b = − log((1 − π)/π), where π specifies that at
the start of training every anchor should be labeled as foreground
with confidence of ∼π.

Re-reading it, I think we misunderstood things, and this issue is correct. It does seem indeed like weights should not be set to 0 (note that it says a confidence of ∼π, not exactly π like we do). I dont believe this will make much difference (floating point calculations are good enough for symmetry breaking in practice).

hgaiser commented 5 years ago

Ugh I hate it when you have to read papers 10 times and still see small details >.>

Looking at Detectron (which I will assume is the correct implementation), it seems like they initialize the weights non-zero indeed:

https://github.com/facebookresearch/Detectron/blob/master/detectron/modeling/retinanet_heads.py#L136-L139

Would be nice to have a comparison to see if this makes any difference.

ps. @penguinmenac3 nice find!

penguinmenac3 commented 5 years ago

Yep i just also read the detectron code.

retnet_cls_pred = model.Conv(
                bl_feat,
                'retnet_cls_pred_fpn{}'.format(lvl),
                dim_in,
                cls_pred_dim * A,
                3,
                pad=1,
                stride=1,
                weight_init=('GaussianFill', {
                    'std': 0.01
                }),
                bias_init=bias_init
            )

hgaiser commented 5 years ago

Closed, thanks for looking into it!

fizyr / keras-retinanet