aiqm / torchani

Accurate Neural Network Potential on PyTorch
https://aiqm.github.io/torchani/
MIT License
459 stars 126 forks source link

CUAEV backward #554

Closed yueyericardo closed 3 years ago

yueyericardo commented 3 years ago

Try to implement backward for cuaev.

zasdfgbnm commented 3 years ago

Reopen to trigger some code review tools

yueyericardo commented 3 years ago

Torchani backward:

2/2 [========] - 0s 236ms/step - rmse: 20.8458
   GPU Memory Cached (pytorch) :  3432.0MB / 8119.6MB (GeForce GTX 1080)
   GPU Memory Used (nvidia-smi):  4129.5MB / 8119.6MB (GeForce GTX 1080)
=> More detail about benchmark PER EPOCH
   Total AEV - 84.3 ms
   Forward - 13.2 ms
   Backward - 182.0 ms
   Force - 181.2 ms
   Optimizer - 9.1 ms
   Others - 3.7 ms
   Epoch time - 473.5 ms

Initial cuaev backward:

2/2 [========] - 1s 663ms/step - rmse: 1323.6314
   GPU Memory Cached (pytorch) :  1634.0MB / 8119.6MB (GeForce GTX 1080)
   GPU Memory Used (nvidia-smi):  2331.5MB / 8119.6MB (GeForce GTX 1080)
=> More detail about benchmark PER EPOCH
   Total AEV - 22.7 ms
   Forward - 13.9 ms
   Backward - 637.2 ms
   Force - 639.9 ms
   Optimizer - 8.8 ms
   Others - 3.8 ms
   Epoch time - 1.326 sec
yueyericardo commented 3 years ago

After using share memory to avoid AtomicAdd

share_mem needed: 23840
2/2 [========] - 1s 310ms/step - rmse: 1323.6313
   GPU Memory Cached (pytorch) :  1634.0MB / 8119.6MB (GeForce GTX 1080)
   GPU Memory Used (nvidia-smi):  2331.5MB / 8119.6MB (GeForce GTX 1080)
=> More detail about benchmark PER EPOCH
   Total AEV - 22.7 ms
   Forward - 13.9 ms
   Backward - 285.4 ms
   Force - 286.3 ms
   Optimizer - 9.1 ms
   Others - 3.8 ms
   Epoch time - 621.2 ms
yueyericardo commented 3 years ago

After optimize some pow() function

2/2 [========] - 0s 133ms/step - rmse: 1323.6313
   GPU Memory Cached (pytorch) :  1434.0MB / 8119.6MB (GeForce GTX 1080)
   GPU Memory Used (nvidia-smi):  2131.5MB / 8119.6MB (GeForce GTX 1080)
=> More detail about benchmark PER EPOCH
   Total AEV - 23.4 ms
   Forward - 13.9 ms
   Backward - 105.6 ms
   Force - 106.4 ms
   Optimizer - 13.2 ms
   Others - 4.0 ms
   Epoch time - 266.6 ms
yueyericardo commented 3 years ago

Remove (warpsize * nbr) unnecessary shared mem, use warplevel aggregate to accumulate grad.

2/2 [========] - 0s 102ms/step - rmse: 1323.6313
   GPU Memory Cached (pytorch) :  1434.0MB / 8119.6MB (GeForce GTX 1080)
   GPU Memory Used (nvidia-smi):  2131.5MB / 8119.6MB (GeForce GTX 1080)
=> More detail about benchmark PER EPOCH
   Total AEV - 23.2 ms
   Forward - 14.0 ms
   Backward - 76.7 ms
   Force - 76.5 ms
   Optimizer - 9.6 ms
   Others - 3.8 ms
   Epoch time - 203.8 ms

Note that the backward timing also includes the network backward. Some info from nsight, cuAngularAEVs is 4.805 ms, cuAngularAEVs_backward is 7.564 ms

cuAngularAEVs
Begins: 1.57992s
Ends: 1.58472s (+4.805 ms)
grid:  <<<19585, 1, 1>>>
block: <<<64, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 4,128 bytes
Registers Per Thread: 60
Local Memory Per Thread: 0 bytes
Local Memory Total: 54,394,880 bytes
Shared Memory executed: 69,632 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 16615
Latency: ←9.781 μs
Correlation ID: 8404
Stream: Default stream (7)
cuAngularAEVs_backward
Begins: 1.61448s
Ends: 1.62205s (+7.564 ms)
grid:  <<<19585, 1, 1>>>
block: <<<64, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 2,560 bytes
Registers Per Thread: 80
Local Memory Per Thread: 0 bytes
Local Memory Total: 55,705,600 bytes
Shared Memory executed: 30,720 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 37.5 %
Launched from thread: 16752
Latency: ←7.508 ms
Correlation ID: 11068
Stream: Default stream (7)

The issue now is Registers Per Thread: 80 exceeds the limit 64.