Closed hezw2016 closed 2 years ago
Refer to a new paper: DeepNet: Scaling Transformers to 1,000 Layers
In fact, most of the models were initialized using a standard deviation of 0.02, since they were done in January. From the experimental results, there is not much difference between the two, but I think it makes more sense to initialize in this way.
Refer to a new paper: DeepNet: Scaling Transformers to 1,000 Layers
In fact, most of the models were initialized using a standard deviation of 0.02, since they were done in January. From the experimental results, there is not much difference between the two, but I think it makes more sense to initialize in this way.
Thank you very much for providing this interesting paper
Happy to help you
Hi, Thank you for sharing this amazing work and code with us. However, I was confused with the _initweights function in dehazeformer.py file. " gain = (8 * self.network_depth) ** (-1/4) " Why the initial weights will relate to network_depth and digit 8?
Thank you again.
Best, Zewei