IntelPython / dpctl

Python SYCL bindings and SYCL-based Python Array API library
https://intelpython.github.io/dpctl/
Apache License 2.0
97 stars 29 forks source link

Tweak scheduling parameter of elementwise operations #1651

Closed oleksandr-pavlyk closed 2 months ago

oleksandr-pavlyk commented 2 months ago

For contiguous inputs, bump local-work-group size from 64 to 128 work-items.

This change is guided by performance study on Newton root finding example rich in elementwise operations.

With this change, unitrace states that 311 invocations of the kernel took 2805824666 ns, before that, with 64 workitems, the time was 3475091844 ns.

                                                      "dpctl::tensor::kernels::multiply::multiply_inplace_contig_kernel<std::complex<float>, std::complex<float>, 4u, 2u>[SIM
D16 {15625; 1; 1} {128; 1; 1}]",          311,           2805824666,    25.380501,              9021944,              8915416,              9143437                          
github-actions[bot] commented 2 months ago

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. :crossed_fingers:

coveralls commented 2 months ago

Coverage Status

coverage: 88.21%. remained the same when pulling a9062030965dd881190a2f83a2307b911e870917 on increase-lws-for-elementwise-operations into 7757857466c2fcfb92e8f8e1ed38e90b35c42327 on master.

github-actions[bot] commented 2 months ago

Array API standard conformance tests for dpctl=0.17.0dev0=py310h15de555_302 ran successfully. Passed: 870 Failed: 8 Skipped: 92