…er initialization. See pdf page 251 in https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf. As set before, the bias may be initialized to a large negative value, leading to a negative input to Relu. This prevents any training, as all derivatives are zero afterward.
…er initialization. See pdf page 251 in https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf. As set before, the bias may be initialized to a large negative value, leading to a negative input to Relu. This prevents any training, as all derivatives are zero afterward.
See downstream issue here: https://github.com/pytorch/benchmark/pull/1927