acts-project / vecmem

Vectorised data model base and helper classes.
https://acts-project.github.io/vecmem/
Mozilla Public License 2.0
19 stars 13 forks source link

Copy Test(s) Rewrite, main branch (2024.02.21.) #268

Closed krasznaa closed 9 months ago

krasznaa commented 9 months ago

While debugging the issue that lead to #264, it became clear that the unit testing of the different vecmem::copy implementations was far from perfect in the project. So I set out to re-write the whole thing.

Previously only vecmem::copy was tested to a reasonable level, as part of vecmem_test_core. Now I turned all of those same tests, with some modifications, into "parametized" tests. Such that they could be executed conveniently on the rest of the "copy implementations" (vecmem::cuda::copy, vecmem::sycl::async_copy, etc.) as well, with various "memory type" combinations.

This turned out to be fairly useful. It highlighted yet another issue in our code, which I now fixed in copy.ipp.

Note that test_cuda_copy.cpp came out a bit more complicated than the rest of them. :frowning: Unfortunately one cannot declare a "global" variable like:

static vecmem::cuda::stream_wrapper foo;

At the end of the application this foo object would be destructed after the CUDA runtime library would've already "shut itself off". Leading to a crash/exception at the very end of the test. To avoid this, I went with:

https://google.github.io/googletest/advanced.html#global-set-up-and-tear-down

Since for parametrized tests one can't easily put the used objects into a scope. :thinking:

Note that in future PRs this technique should allow us to extend for instance tests/cuda/test_cuda_edm_copy.cpp to exercise vecmem::cuda::async_copy as well. :thinking: (I didn't want to make this PR even larger with that change...)

krasznaa commented 9 months ago

Unfortunately additional testing on a discreet Intel GPU with both its Level-0 and OpenCL drivers is producing a whole bunch of failures from the tests. 😦 For some reason vecmem::sycl::async_copy is just not working correctly with that backend. 😕

Continuing with the debugging for now...

krasznaa commented 9 months ago

Turns out, it was our code that was working incorrectly all along. 😦 Had to introduce a bunch more explicit synchronization points to make vecmem::sycl::async_copy work reliably. For things that SYCL can do asynchronously on an Intel backend, but not on NVIDIA and AMD ones. 🤔

In the process I also found one more type of issue with our current tests on Intel GPUs. But that will be fixed in a separate PR.