Closed stratika closed 1 month ago
It is not clear to me why we need to narrow the type from FP64 to FP32 if the operation selected is double
. The PTX instruction set supports FP64. Can you expand on this?
It is not clear to me why we need to narrow the type from FP64 to FP32 if the operation selected is
double
. The PTX instruction set supports FP64. Can you expand on this?
Yes of course, the operations cos
and sin
in PTX support only FP32.
This is something that we have already in develop when we run cos
and sin
operations with double
types. However, the cospi
and sinpi
operations were escaping and producing compilation problems regarding mismatched types of the operands.
Thanks @stratika for the clarification. I saw how the NVIDIA compile handles when using OpenCL. It seems they use an intermediate table with pre-compute values called __cudart_sin_cos_coeffs
. Let's continue with the narrow option and investigate this type of optimizations in the future.
As a side note, can we document somewhere that we narrow these operations in the context of NVIDIA PTX?
As a side note, can we document somewhere that we narrow these operations in the context of NVIDIA PTX?
Do we have such documentation for the other narrowed operations? Any suggestions? Adding a Javadoc?
I think the best way is to add Javadoc in the the TornadoMath API for those operations.
Description
This PR fixes some failing unit-tests for
PTX
.Problem description
Here is a description of the reason of failing for every fixed test:
testTornadoMathCosPIDouble
andtestTornadoMathSinPIDouble
:cos
andsin
instructions supportf32
precision. So, we had to apply casting fromf64
tof32
for those operators. For reference, see here.testCopyInWithDevice
: This is a test that asserts whether there is a difference in the copy in data transfer timers, if the taskgraph is executed with thewithDevice
function, or not. This was due to a problem that was observed when thewithDevice
function was used, the behavior of an array that was defined with DataTransfer modeFIRST_EXECUTION
was as if it was defined withEVERY_EXECUTION
. Thus, we assess if the aggregated time is significantly higher. I increased the data size and number of iterations to make the case easier to track, and increased the delta for both metrics to be within a range of 25%.Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?