[x] Reference [13] is not really an ML paper, therefore it is a bit strange that you cite it that often.
One possible workaround is to create your own categorisation and cite [13] as the source of inspiration
[x] A reference should allow people to look up the referenced work (in most cases).
This would be a problem for reference [15]
[x] missing a reference to (Blumenfeld et al., 2020)
[x] How did you go from equations (2.8)/(2.12) to (2.9)/(2.13)? There seem to be a few steps (or references) missing
[x] How did you solve the system of equations given by (2.15) and (2.16)
PS: equation (2.16) should have the square outside of ELU and you can use \begin{cases}\end{cases} to typeset systems of equations
[x] Section 2 and 3 both seem to list various initialization methods and it is not entirely clear (to me) how you made the division.
Try to make a clear distinction between trivial/established methods and your contributions.
PS: the title "background" implies that a section presents well-established concepts (no new contributions)
[x] There is a paper introducing PyTorch that you can/should cite: (Paszke et al., 2019)
[x] Note that MNIST is much older than reference [11] make believe: (Bottou et al., 1994)
[x] n section 4.2: why are smaller variations of the signals advantageous?
[x] What is the deterministic initialization approach in Figure 5.5 exactly?
[x] Some of the loss curve plots (e.g. in figures 5.7 to 5.9) become hard to interpret.
Better visibility is often possible if you plot the log-loss
[x] typo in section 5.2: gab -> gap
[x] do not force page breaks in scientific documents. Let LaTeX do its thing.
Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994).
Comparison of classifier methods: a case study in handwritten digit recognition.
Proceedings of the 12th IAPR International Conference on Pattern Recognition, 2, 77–82.
https://doi.org/10.1109/ICPR.1994.576879
Blumenfeld, Y., Gilboa, D., & Soudry, D. (2020).
Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization?
Proceedings of the 37th International Conference on Machine Learning, 119, 960–969.
http://proceedings.mlr.press/v119/blumenfeld20a.html
\begin{cases}\end{cases}
to typeset systems of equations