Open wwt17 opened 6 years ago
I found your code measuring norms and clipping gradients for each parameter separately. But in the paper, the authors said "the l2 norm of the whole gradient of all parameters..." and your method was used in QA tasks.
I found your code measuring norms and clipping gradients for each parameter separately. But in the paper, the authors said "the l2 norm of the whole gradient of all parameters..." and your method was used in QA tasks.