Fix failing unit-tests for PTX in Jenkins

stratika commented 1 month ago

Description

This PR fixes some failing unit-tests for PTX.

Problem description

Here is a description of the reason of failing for every fixed test:

testTornadoMathCosPIDouble and testTornadoMathSinPIDouble: cos and sin instructions support f32 precision. So, we had to apply casting from f64 to f32 for those operators. For reference, see here.
testCopyInWithDevice: This is a test that asserts whether there is a difference in the copy in data transfer timers, if the taskgraph is executed with the withDevice function, or not. This was due to a problem that was observed when the withDevice function was used, the behavior of an array that was defined with DataTransfer mode FIRST_EXECUTION was as if it was defined with EVERY_EXECUTION. Thus, we assess if the aggregated time is significantly higher. I increased the data size and number of iterations to make the case easier to track, and increased the delta for both metrics to be within a range of 25%.

Backend/s tested

Mark the backends affected by this PR.

[ ] OpenCL
[x] PTX
[ ] SPIRV

OS tested

Mark the OS where this PR is tested.

[x] Linux
[ ] OSx
[ ] Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

[ ] Yes
[x] No

How to test the new patch?

make BACKEND=ptx
make tests

jjfumero commented 1 month ago

It is not clear to me why we need to narrow the type from FP64 to FP32 if the operation selected is double. The PTX instruction set supports FP64. Can you expand on this?

stratika commented 1 month ago

It is not clear to me why we need to narrow the type from FP64 to FP32 if the operation selected is double. The PTX instruction set supports FP64. Can you expand on this?

Yes of course, the operations cos and sin in PTX support only FP32.

This is something that we have already in develop when we run cos and sin operations with double types. However, the cospi and sinpi operations were escaping and producing compilation problems regarding mismatched types of the operands.

jjfumero commented 1 month ago

Thanks @stratika for the clarification. I saw how the NVIDIA compile handles when using OpenCL. It seems they use an intermediate table with pre-compute values called __cudart_sin_cos_coeffs . Let's continue with the narrow option and investigate this type of optimizations in the future.

jjfumero commented 1 month ago

As a side note, can we document somewhere that we narrow these operations in the context of NVIDIA PTX?

stratika commented 1 month ago

As a side note, can we document somewhere that we narrow these operations in the context of NVIDIA PTX?

Do we have such documentation for the other narrowed operations? Any suggestions? Adding a Javadoc?

jjfumero commented 1 month ago

I think the best way is to add Javadoc in the the TornadoMath API for those operations.

beehive-lab / TornadoVM