Open satyabhagavan opened 1 month ago
@hwu36
True fp32 is not supported by tensor core. Only tf32 can use tensor core. Do you want to convert fp32 to tf32 before the computation?
Do you want to support fprop or wgrad or anything else?
The inline ptx in scale_bias_relu_transform.h is hard coded for fp16x2, not for fp32. you don't have to write inline ptx, but just write cuda. Something like
if (input != special_nan) { // we use a special nan to mark out of bound data. we use 0x7eff for fp16 special nan.
float res = input > float(0) ? input : input * leaky_alpha;
}
@hwu36 I implemented in the same way by defining leaky_alpha as float(0.1), by doing that I am getting nans at the output. Do I need to change the 'MmaElements' and 'MmaCols' in the scale_bias_relu_transform.h file in converting to floats.
Yes. You'd better dump the value of matrix, bias, scale first to see if every thread owns the right data. You can use 1,2,3,4... to initialize a small matrix to do that.
Mainloop fusion is the most difficult one. If possible, you'd better do the fusion in the previous kernel epilogue, which is easier and has better performance.
What is your question?
The mainloop fusion examples provided in 25_ampere_fprop_mainloop_fusion and 26_ampere_wgrad_mainloop_fusion use half-precision (float16). I want to adapt these examples to work with single-precision (float32). I changed the element types and tile shapes to support floats, but the examples are failing. To understand why, I examined the scale_bias_relu_transform.h file. I believe changes are needed there. Could anyone guide me on how to achieve correctness with floats? Additionally, the activation function used is ReLU. Is it possible to implement the LeakyReLU activation function in the mainloop fusion? If so, how can this be done?