Closed Johnson-yue closed 4 years ago
Yes, it usually takes 4x-6x longer than Adam. Because ACGD has Conjugate Gradient and Hessian vector product inside.
Part of reason can be that the code is not optimal. Efficiency of the algorithm is one important future work. What's more, some parts are limited by Pytorch framework. For example, Pytorch only supports backprop but the most efficient way to do Hessian vector product is to combine backprop and forward mode autodiff.
As a reminder, it's recommended to set torch.backends.cudnn.benchmark = True
.
Sometimes cudnn heuristics picks truly atrocious algorithm for couple layers which may make ACGD training super slow.
Can it be faster?
I think ACGD is cost very long time in train ,why ?