I step through this repo and find an interesting typo (maybe?) that the hidden states are dropout twice. So why do you impletement in this way but not use a higher dropout rate? Is it a typo? If it is a type, do you use this typoed code for pre-training?
I step through this repo and find an interesting typo (maybe?) that the hidden states are dropout twice. So why do you impletement in this way but not use a higher dropout rate? Is it a typo? If it is a type, do you use this typoed code for pre-training?
https://github.com/THUDM/Chinese-Transformer-XL/blob/0451869ee1c435929fcf5851e4a86a8b228a5e8f/mpu/transformer.py#L534
https://github.com/THUDM/Chinese-Transformer-XL/blob/0451869ee1c435929fcf5851e4a86a8b228a5e8f/mpu/transformer.py#L540