Enable H100 in CMake (with A100 parameters)

RMeli commented 3 weeks ago

Tested with Spack.

dbcsr@develop +cuda cuda_arch=90

-- GPU target architecture: H100
-- Kernel parameters: A100
-- GPU architecture number: 90
-- GPU profiling enabled: OFF
-- GPU aware MPI enabled: OFF

dbcsr@develop +cuda cuda_arch=80

-- GPU target architecture: A100
-- Kernel parameters: A100
-- GPU architecture number: 80
-- GPU profiling enabled: OFF
-- GPU aware MPI enabled: OFF

hfp commented 3 weeks ago

LGTM aka best effort.

hfp commented 3 weeks ago

( Let's pretend B100/200 has limited audience ;-)

alazzaro commented 3 weeks ago

Sorry to repeat myself (see https://github.com/cp2k/dbcsr/pull/656), with these changes we are ruining the entire idea of autotuning... The last "extensive" autotuning was done for P100 (Daint) and V100. These are about 74100 kernels.

Then @dev-zero and @mkrack did for A100, with about the same number of kernels.

@mtaillefumier did for Mi100, but with much less kernels (412 kernels), and I did repeat the same kernels optimization for Mi250.

Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels. Fine if you are doing in CP2K (just like @hfp did, see https://github.com/cp2k/cp2k/pull/3368), but this is not an option for DBCSR.

My current idea is to introduce a "General" kernel and show in the output a (*) for the kernels that they are not using it. Still, people should use autotuning and contribute to the kernels to get the "best" performance.

Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?

abussy commented 3 weeks ago

The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.

As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.

hfp commented 3 weeks ago

Prediction code is rapidly aging and we have this issue hanging since quite some time.

alazzaro commented 3 weeks ago

The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.

As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.

You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.

RMeli commented 3 weeks ago

Sorry to repeat myself (see https://github.com/cp2k/dbcsr/pull/656), with these changes we are ruining the entire idea of autotuning...

I totally agree with this, and that's why I marked is a TODO. However, I must have misunderstood our discussion at PASC24.

Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels.

As mentioned at PASC24, we tried to run auto-tuning for GH200 at CSCS but it is not clear to us who is actually responsible of contributing the kernels and checking that everything is in order. I was under the impression that you were about to get access to H100.

If I recall correctly our discussion at PASC24, you mentioned the following:

When autotuning fails for one reason or another the kernel is considered slow, and there is no check
Predictive tuning does not work anymore

Therefore, I was under the impression than the whole auto-tuning pipeline needs attention, and that's why I opened this PR as a temporary workaround (it might still be beneficial to target the proper architecture in the meantime).

Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?

This is more in line with what we discussed, if I recall correctly, which is why I opened this PR. However, at the moment we don't have numbers comparing directly the two parameter sets.

BTW, src/acc/cuda/Makefile contains the following:

https://github.com/cp2k/dbcsr/blob/0f47720dd4d9b2f01eb4e5fb8bed3dc2f7bca928/src/acc/cuda/Makefile#L85-L88

alazzaro commented 3 weeks ago

let's turn this PR in an issue, I'm open to discussion.

For the record, what I said at PASC is that autotuning is old and need refresh (2018 was the last major refresh by Shoshana + some minor updates by me and @dev-zero ), but this is what we have and we are supposed to use it (or drop it), I didn't propose any workaround. The file you are mentioning (https://github.com/cp2k/dbcsr/blob/0f47720dd4d9b2f01eb4e5fb8bed3dc2f7bca928/src/acc/cuda/Makefile#L85-L88) is used for internal testing by @hfp , nothing related to the library itself.

The entire machinery is on users to provide optimized kernels, as described in the documentation. Likely I need to add "user" to the documentation to make it clear, good point.

mkrack commented 3 weeks ago

You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.

That's not correct, I used the ML prediction to create the 71048 predicted A100 kernels in addition to the 3043 autotuned ones. The scripts required some fixes at that time (summer 2023), but worked on JURECA like for the P100. From my experience, I can comment the following:

The performance gain with the tuned A100 kernels is minor compared to using the P100 kernels like the tuned P100 kernels work reasonably well for V100.
The tuning consumes a significant amount of computational (and human) resources.
The ML requires a sufficient number of autotuned kernels to make sense.
It is better to use the full set of autotuned and predicted kernels from the previous GPU generation than to use only a relative small set of autotuned kernels.
Once enough autotuned kernels have been accumulated, an ML step could be attempted.

alazzaro commented 3 weeks ago

Thanks @mkrack , this is a very nice feedback and thanks for the clarification (it turns out I did a grep for "predicted" in the wrong file! You are definitely right).

OK, then I think we are going to the conclusion that we can drop the ML predict part and likely the autotuning at all (we will keep it to add new kernels). I think @RMeli and @abussy went the same conclusion.

Then, the strategy will be to rename the file/parameters in "AMD" and "NVIDIA" and drop the specific GPU version. As I said, I will add a generic kernel which will be good enough fo all cases we don't cover with autotuning.

RMeli commented 3 weeks ago

I didn't propose any workaround.

Yes, apologies for the confusion. The workaround was my interpretation, also based on what it is done for CP2K and what I saw in the repository here (out of context).

RMeli commented 3 weeks ago

Thank you everyone for the input. Let's move the discussion to #805.

cp2k / dbcsr

Enable H100 in CMake (with A100 parameters) #804