This PR fixes most of the tests that fails on AMD and NVIDIA GPUs using DEFAULT configuration.
It fixes all of them for AMD and let only trsm operator to be fixed for NVIDIA.
In particular it fixes:
iamax
iamin
trsv
tbsv
tpsv
iamax/iamin:
The sycl:shift_group_left api requires all group(sub_group) takes part to the operation, removing the if-condition solves the problem.
txsv operators:
broadcast operations inside the kernel require a specific size of group and subgroup, so calling the kernel implementation from default is not enough due to hardware differences. This solution uses runtime checks to select the correct template parameters. This leads to compile more kernels than before but from my tests it doesn't affect significantly compilation time.
This PR fixes most of the tests that fails on AMD and NVIDIA GPUs using DEFAULT configuration. It fixes all of them for AMD and let only
trsm
operator to be fixed for NVIDIA.In particular it fixes:
iamax/iamin: The
sycl:shift_group_left
api requires all group(sub_group) takes part to the operation, removing the if-condition solves the problem.txsv operators: broadcast operations inside the kernel require a specific size of group and subgroup, so calling the kernel implementation from
default
is not enough due to hardware differences. This solution uses runtime checks to select the correct template parameters. This leads to compile more kernels than before but from my tests it doesn't affect significantly compilation time.