NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5k stars 853 forks source link

[QST] How to make Main loop fusion for floats #1602

Open satyabhagavan opened 1 month ago

satyabhagavan commented 1 month ago

What is your question?

The mainloop fusion examples provided in 25_ampere_fprop_mainloop_fusion and 26_ampere_wgrad_mainloop_fusion use half-precision (float16). I want to adapt these examples to work with single-precision (float32). I changed the element types and tile shapes to support floats, but the examples are failing. To understand why, I examined the scale_bias_relu_transform.h file. I believe changes are needed there. Could anyone guide me on how to achieve correctness with floats? Additionally, the activation function used is ReLU. Is it possible to implement the LeakyReLU activation function in the mainloop fusion? If so, how can this be done?

thakkarV commented 3 weeks ago

@hwu36

hwu36 commented 3 weeks ago

True fp32 is not supported by tensor core. Only tf32 can use tensor core. Do you want to convert fp32 to tf32 before the computation?

Do you want to support fprop or wgrad or anything else?

The inline ptx in scale_bias_relu_transform.h is hard coded for fp16x2, not for fp32. you don't have to write inline ptx, but just write cuda. Something like

if (input != special_nan) {  // we use a special nan to mark out of bound data.  we use 0x7eff for fp16 special nan.
  float res = input > float(0) ? input : input * leaky_alpha;
}
satyabhagavan commented 3 weeks ago

@hwu36 I implemented in the same way by defining leaky_alpha as float(0.1), by doing that I am getting nans at the output. Do I need to change the 'MmaElements' and 'MmaCols' in the scale_bias_relu_transform.h file in converting to floats.

hwu36 commented 3 weeks ago

Yes. You'd better dump the value of matrix, bias, scale first to see if every thread owns the right data. You can use 1,2,3,4... to initialize a small matrix to do that.

Mainloop fusion is the most difficult one. If possible, you'd better do the fusion in the previous kernel epilogue, which is easier and has better performance.