Closed RMeli closed 3 weeks ago
LGTM aka best effort.
( Let's pretend B100/200 has limited audience ;-)
Sorry to repeat myself (see https://github.com/cp2k/dbcsr/pull/656), with these changes we are ruining the entire idea of autotuning... The last "extensive" autotuning was done for P100 (Daint) and V100. These are about 74100 kernels.
Then @dev-zero and @mkrack did for A100, with about the same number of kernels.
@mtaillefumier did for Mi100, but with much less kernels (412 kernels), and I did repeat the same kernels optimization for Mi250.
Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels. Fine if you are doing in CP2K (just like @hfp did, see https://github.com/cp2k/cp2k/pull/3368), but this is not an option for DBCSR.
My current idea is to introduce a "General" kernel and show in the output a (*) for the kernels that they are not using it. Still, people should use autotuning and contribute to the kernels to get the "best" performance.
Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?
The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.
As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.
Prediction code is rapidly aging and we have this issue hanging since quite some time.
The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.
As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.
You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.
Sorry to repeat myself (see https://github.com/cp2k/dbcsr/pull/656), with these changes we are ruining the entire idea of autotuning...
I totally agree with this, and that's why I marked is a TODO
. However, I must have misunderstood our discussion at PASC24.
Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels.
As mentioned at PASC24, we tried to run auto-tuning for GH200 at CSCS but it is not clear to us who is actually responsible of contributing the kernels and checking that everything is in order. I was under the impression that you were about to get access to H100.
If I recall correctly our discussion at PASC24, you mentioned the following:
Therefore, I was under the impression than the whole auto-tuning pipeline needs attention, and that's why I opened this PR as a temporary workaround (it might still be beneficial to target the proper architecture in the meantime).
Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?
This is more in line with what we discussed, if I recall correctly, which is why I opened this PR. However, at the moment we don't have numbers comparing directly the two parameter sets.
BTW, src/acc/cuda/Makefile
contains the following:
let's turn this PR in an issue, I'm open to discussion.
For the record, what I said at PASC is that autotuning is old and need refresh (2018 was the last major refresh by Shoshana + some minor updates by me and @dev-zero ), but this is what we have and we are supposed to use it (or drop it), I didn't propose any workaround. The file you are mentioning (https://github.com/cp2k/dbcsr/blob/0f47720dd4d9b2f01eb4e5fb8bed3dc2f7bca928/src/acc/cuda/Makefile#L85-L88) is used for internal testing by @hfp , nothing related to the library itself.
The entire machinery is on users to provide optimized kernels, as described in the documentation. Likely I need to add "user" to the documentation to make it clear, good point.
You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.
That's not correct, I used the ML prediction to create the 71048 predicted A100 kernels in addition to the 3043 autotuned ones. The scripts required some fixes at that time (summer 2023), but worked on JURECA like for the P100. From my experience, I can comment the following:
Thanks @mkrack , this is a very nice feedback and thanks for the clarification (it turns out I did a grep for "predicted" in the wrong file! You are definitely right).
OK, then I think we are going to the conclusion that we can drop the ML predict part and likely the autotuning at all (we will keep it to add new kernels). I think @RMeli and @abussy went the same conclusion.
Then, the strategy will be to rename the file/parameters in "AMD" and "NVIDIA" and drop the specific GPU version. As I said, I will add a generic kernel which will be good enough fo all cases we don't cover with autotuning.
I didn't propose any workaround.
Yes, apologies for the confusion. The workaround was my interpretation, also based on what it is done for CP2K and what I saw in the repository here (out of context).
Thank you everyone for the input. Let's move the discussion to #805.
Tested with Spack.