Speeding up sum reductions in ADADELTA by using Tensor Cores

L30nardoSV commented 8 months ago

Hi,

This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.

The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.

Experiments on A100 GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test):

Docking time	Original	Tensor
In seconds	0.8	0.6

Experiments on RTX3050Ti GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test):

Docking time	Original	Tensor
In seconds	2.4	1.7

The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well:

Schieffer, Gabin, and Peng, Ivy. "Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores."
In European Conference on Parallel Processing, pp. 608-622. Cham: Springer Nature Switzerland, 2023.

atillack commented 8 months ago

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

L30nardoSV commented 8 months ago

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

OK, let me know if the TENSOR directive in commit 10b07fa6a suffices

atillack commented 8 months ago

@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input:

Docking time of the PR with make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test is 0.70 seconds vs 0.90 seconds (this does use the heuristics and autostop by default).

To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results:

Reference:	Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking
OpenCL	128	no, 2.5M evals	105382018	36 / 42 good	36 / 42 good	91.69 s	0.17 s
Cuda	128	no, 2.5M evals	105331182	36 / 42 good	36 / 42 good	90.75 s	0.32 s
Cuda	64	no, 2.5M evals	105404961	36 / 42 good	35 / 42 good	106.24 s	7.93 s
OpenCL	128	yes	84026192	37 / 42 good	37 / 42 good	187.82 s	0.21 s
Cuda	128	yes	80037847	38 / 42 good	38 / 42 good	184.85 s	0.20 s
Cuda	64	yes	84684628	36 / 42 good	38 / 42 good	233.39 s	8.13 s

This PR:	Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking
OpenCL	128	no, 2.5M evals	105362595	38 / 42 good	36 / 42 good	92.33 s	0.21 s
Cuda	128	no, 2.5M evals	105177642	35 / 42 good	36 / 42 good	100.71 s	0.20 s
Cuda	64	no, 2.5M evals	105197433	35 / 42 good	38 / 42 good	112.48 s	0.19 s
OpenCL	128	yes	86495325	37 / 42 good	38 / 42 good	192.71 s	0.21 s
Cuda	128	yes	71419809	33 / 42 good	37 / 42 good	182.30 s	0.21 s
Cuda	64	yes	65754981	34 / 42 good	37 / 42 good	214.60 s	0.22 s

For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown.

It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run).

@diogomart Please run your E50 tests for the Cuda version.

atillack commented 8 months ago

@L30nardoSV Thank you for the encapsulation :-)

diogomart commented 8 months ago

Unfortunately, algorithmic performance is worse.

79f13c7-ocl-128wi_vs_PR252-10b07fa-cuda-tensor-128wi-overlap

L30nardoSV commented 7 months ago

@atillack

Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?

make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 TENSOR=ON TCEC=ON test

Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md

atillack commented 7 months ago

@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):	Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking
OpenCL	128	yes	86495325	37 / 42 good	38 / 42 good	192.71 s	0.21 s
Cuda	128	yes	88164353	36 / 42 good	38 / 42 good	194.13 s	7.74 s
Cuda	64	yes	77884078	37 / 42 good	37 / 42 good	214.27 s	21.54 s

atillack commented 7 months ago

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

L30nardoSV commented 7 months ago

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

Thanks, I look forward to seeing whether the search efficiency is fine at least

diogomart commented 7 months ago

I'll get to this soon

diogomart commented 2 months ago

@L30nardoSV sorry for the long delay. I don't see an improvement, unfortunately. Very similar results to the previous commit. 79f13c7-ocl-128wi_vs_PR252-b2ab3fe-cuda-tensor-128wi-overlap

L30nardoSV commented 2 months ago

@diogomart thanks! I just want to make sure: did you compile with both TENSOR=ON TCEC=ON ?

diogomart commented 2 months ago

With TENSOR=ON yes, but not with TCEC=ON

L30nardoSV commented 2 months ago

With TENSOR=ON yes, but not with TCEC=ON

Can you please try again with TENSOR=ON and TCEC=ON?

diogomart commented 2 months ago

TCEC fixed it :+1:

79f13c7-ocl-128wi_vs_PR252-b2ab3fe-cuda-tensor-tcec-128wi-overlap

atillack commented 2 months ago

Thank you @diogomart! Glad the search performance is back to normal. There is no measurable performance benefit though and it adds a wrinkle between Cuda and OpenCL code paths. So, I am not sure what to do with this PR ...

diogomart commented 2 months ago

I think there's a small improvement over plain CUDA but OpenCL is still the fastest. (this was run in a mix of different GPUs, so one should mentally blur the plot to interpret it meaningfully)

PR255-7007db8-cuda-128wi-overlap--PR252-b2ab3fe-cuda-tensor-tcec-128wi-overlap runtime

atillack commented 2 months ago

What is the overall runtime divided by the overall number of evals? (and what is the std.error?)

atillack commented 2 months ago

(and by runtime, i really mean docking time - although with overlap, runtime is probably good enough)

atillack commented 2 months ago

Also, did you run on the same type of GPU?

diogomart commented 2 months ago

no, this was a mix of rtx5000 and rtx6000

atillack commented 2 months ago

here is the data for the reduced set of 42:

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
Cuda	128	yes	80037847	38 / 42 good	38 / 42 good	184.85 s	0.20 s
Cuda	128	yes	88164353	36 / 42 good	38 / 42 good	194.13 s	7.74 s

In other words, i've seen 184,850,000 microseconds / 80,037,847 = 2.31 microseconds per eval vs 2.20 microseconds per eval for this PR which would be about 5% faster ... except that that's right around what i would estimate AutoStop to fluctuate by ...

I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial.

L30nardoSV commented 2 months ago

Thanks @diogomart and @atillack

Glad to see that by enabling the error correction code (via TCEC) the docking quality is back to normal.

I'll run with a fixed number of evals later to get one more data point and try to get on one of our newest cards as well to see if newer tensor cores might be beneficial.

I look forward to seeing those numbers :)

diogomart commented 2 months ago

Each marker is system and aggregates data from 32000 to 8M evals. time_per_eval_tensor_vs_opencl

atillack commented 2 months ago

@L30nardoSV There is a speedup with increasingly more tensor core-heavy cards and with smaller ligands \o/

Please go ahead and remove the non-TCEC code.

Here is the data using varying combinations of AutoStop and Heuristics (NUMWI=128, TARGETS=86) for an RTX A5000:

A5000	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup
Reference	no & no	105356778	37 / 42 good	37 / 42 good	110.19 s	1.046
Tensor + TCEC	no & no	105336788	36 / 42 good	38 / 42 good	97.01 s	0.921	1.14x
Reference	no & yes	110331676	37 / 42 good	36 / 42 good	237.74 s	2.155
Tensor + TCEC	no & yes	110389153	39 / 42 good	38 / 42 good	213.16 s	1.931	1.12x
Reference	yes & yes	81545345	36 / 42 good	37 / 42 good	203.22 s	2.492
Tensor + TCEC	yes & yes	76813659	38 / 42 good	38 / 42 good	184.35 s	2.400	1.04x

From this data it seems that small ligands benefit more than larger ones (the fraction of evals spent on larger ligands increases with heuristics and autostop, compared to every ligand getting 2.5M evals)

Data for an RTX A6000 Ada showing overall similar relative speedup (compiled same as above):

A6000 Ada	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup
Reference	no & no	105393255	35 / 42 good	38 / 42 good	49.70 s	0.472
Tensor + TCEC	no & no	105405382	37 / 42 good	37 / 42 good	43.53 s	0.413	1.14x
Reference	no & yes	110421843	37 / 42 good	38 / 42 good	102.46 s	0.928
Tensor + TCEC	no & yes	110388427	39 / 42 good	36 / 42 good	94.18 s	0.853	1.09x
Reference	yes & yes	83999741	39 / 42 good	37 / 42 good	91.50 s	1.089
Tensor + TCEC	yes & yes	78516735	39 / 42 good	37 / 42 good	81.78 s	1.042	1.05x

Data for an H100 (compiled same as above, except TARGETS=90):

H100	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup
Reference	no & no	105332904	37 / 42 good	36 / 42 good	93.13 s	0.884
Tensor + TCEC	no & no	105406910	37 / 42 good	37 / 42 good	68.27 s	0.648	1.36x
Reference	no & yes	110399211	38 / 42 good	38 / 42 good	183.82 s	1.665
Tensor + TCEC	no & yes	110451270	39 / 42 good	38 / 42 good	152.55 s	1.381	1.21x
Reference	yes & yes	82377612	39 / 42 good	38 / 42 good	160.04 s	1.943
Tensor + TCEC	yes & yes	81215241	37 / 42 good	37 / 42 good	133.76 s	1.647	1.18x

So with a card with very strong tensor cores there's a bit more of a benefit ... Interestingly for docking, the newer Ada generation A6000 Ada cards are still much better overall.

L30nardoSV commented 2 months ago

@atillack Many thanks for the detailed evaluation!

I will get the non-TCEC code removed in the next few days

L30nardoSV commented 2 months ago

@atillack @diogomart Please test commit https://github.com/ccsb-scripps/AutoDock-GPU/pull/252/commits/162a850952c29260c5ecc2ad556aefe774f48f87

Only TENSOR=ON is required, because now TCEC is enabled by default (non-TCEC code was removed).

atillack commented 2 months ago

@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance.

There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-)

OpenCL runs of the most recent code (261c91f) run on the same nodes as above:

GPU	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup vs PR TCEC
A5000	no & no	105421266	36 / 42 good	38 / 42 good	91.58 s	0.869	1.06x
A5000	no & yes	110360358	39 / 42 good	38 / 42 good	212.89 s	1.929	1.00x
A5000	yes & yes	80855441	36 / 42 good	38 / 42 good	184.47 s	2.281	1.05x

A6000 Ada	no & no	105393969	37 / 42 good	36 / 42 good	40.29 s	0.382	1.08x
A6000 Ada	no & yes	110372962	38 / 42 good	39 / 42 good	91.14 s	0.826	1.03x
A6000 Ada	yes & yes	78186199	37 / 42 good	36 / 42 good	79.45 s	1.016	1.03x

H100	no & no	105462969	36 / 42 good	37 / 42 good	64.67 s	0.613	1.06x
H100	no & yes	110356319	39 / 42 good	37 / 42 good	145.83 s	1.321	1.05x
H100	yes & yes	88132921	37 / 42 good	37 / 42 good	136.91 s	1.553	1.06x

atillack commented 2 months ago

@L30nardoSV @diogomart Code is now updated to current develop branch, here is a quick benchmark on the A6000 Ada showing a speedup (!) of TENSOR=ON over OpenCL between 1-4%.

GPU	AutoStop & Heuristics	overall evals	energy	rmsd	docking	microseconds per eval	speedup vs OpenCL
A6000 Ada	no & no	105308281	36 / 42 good	36 / 42 good	38.77 s	0.368	1.04x
A6000 Ada	no & yes	110376381	38 / 42 good	37 / 42 good	89.92 s	0.815	1.01x
A6000 Ada	yes & yes	79465317	38 / 42 good	37 / 42 good	78.25 s	0.985	1.03x

atillack commented 2 months ago

@diogomart Please rerun verification :-)

atillack commented 2 months ago

@L30nardoSV @diogomart I optimized and cleaned up the WMMA code a bit, added checks to make sure the device we're running on is able to run the tensor core sum reductions, and now also automatically set the minimum compute capability to 8.0 so make TENSOR=ON is all that's needed now.

From my end that's all the code changes and I'll approve/merge when Diogo's regression check is successful.

L30nardoSV commented 2 months ago

Thank you @atillack. I look forward to seeing the results of @diogomart's check :)

hwcopeland commented 20 hours ago

@L30nardoSV I am in favor of merging this PR as an option to "future-proof" the Cuda branch a bit. There are still a couple modifications needed - those should be minor, but i since unified the host code for example and made some other small tweaks to performance.

There is no free lunch though: theoretical FP32 flops w/ vs. w/o tensor cores and memory bandwidth stay the same, it's just a question which implementation is more efficient (unsurprisingly Nvidia's WMMA does a good job). So even with this PR and the tensor cores used, the Cuda path is still currently at best as fast as the OpenCL path (which typically is about 5% faster). There is a chance though some of the changes i made also will speed up this PR ;-)

OpenCL runs of the most recent code (261c91f) run on the same nodes as above:

GPU AutoStop & Heuristics overall evals energy rmsd docking microseconds per eval speedup vs PR TCEC A5000 no & no 105421266 36 / 42 good 38 / 42 good 91.58 s 0.869 1.06x A5000 no & yes 110360358 39 / 42 good 38 / 42 good 212.89 s 1.929 1.00x A5000 yes & yes 80855441 36 / 42 good 38 / 42 good 184.47 s 2.281 1.05x A6000 Ada no & no 105393969 37 / 42 good 36 / 42 good 40.29 s 0.382 1.08x A6000 Ada no & yes 110372962 38 / 42 good 39 / 42 good 91.14 s 0.826 1.03x A6000 Ada yes & yes 78186199 37 / 42 good 36 / 42 good 79.45 s 1.016 1.03x H100 no & no 105462969 36 / 42 good 37 / 42 good 64.67 s 0.613 1.06x H100 no & yes 110356319 39 / 42 good 37 / 42 good 145.83 s 1.321 1.05x H100 yes & yes 88132921 37 / 42 good 37 / 42 good 136.91 s 1.553 1.06x

Thank you I've been looking everywhere for numbers like this.

ccsb-scripps / AutoDock-GPU

Speeding up sum reductions in ADADELTA by using Tensor Cores #252