cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

alpaka-related updates #41340

Open fwyzard opened 1 year ago

fwyzard commented 1 year ago

I'm collecting some "to do" items to improve the alpaka-related developments:

fwyzard commented 1 year ago

assign heterogeneous

cmsbuild commented 1 year ago

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 1 year ago

https://github.com/cms-sw/cmssw/pull/41341 should address the second bullet, at least in part. To be checked if there are any other leftovers.

makortel commented 1 year ago

On the first point I wonder if we could (at least eventually) utilize scram itself to avoid running the tests in the first place that would fail because of missing hardware? @smuzaffar

fwyzard commented 1 year ago

Interesting idea.

For a CUDA-only test we could do something like this:

<bin name="cudaTest" file="cudaTest.cu">
  <use name="cuda"/>
  <flags TEST_RUNNER_CMD="cudaIsEnabled &amp;&amp; ./cudaTest || echo 'Failed to initialise the CUDA runtime, the test will be skipped.'"/>
</bin>

For an Alpaka-based test there are two problems:

The first problem could be solved if SCRAM could provide an environment variable that expands to the name of the binary (or to its full path) that can be used inside TEST_RUNNER_CMD, something like

  <flags TEST_RUNNER_CMD="cudaIsEnabled &amp;&amp; ./$TEST_BINARY_NAME || echo 'Failed to initialise the CUDA runtime, the test $TEST_BINARY_NAME will be skipped.'"/>

The second problem could be solved is SCRAM could provide an environment variable with the name of the Alpaka backend being tested, like SerialSync or ROCmAsync, something like

  <flags TEST_RUNNER_CMD="alpakaIsEnabled$TEST_ALPAKA_BACKEND &amp;&amp; ./$TEST_BINARY_NAME || echo 'Failed to initialise the $TEST_ALPAKA_BACKEND backend, the test $TEST_BINARY_NAME will be skipped.'"/>

Of course, it would be even nicer if SCRAM could automate this with a single flag, something like

  <flags TEST_REQUIRE_ALPAKA_BACKEND/>

and report these tests as Skip in the console, instead of Pass/Fail.

fwyzard commented 1 year ago

And to clarify: alpakaIsEnabled$TEST_ALPAKA_BACKEND does not exist yet, but it's relatively trivial to write; I'll see if I can make a PR for it.

smuzaffar commented 1 year ago

@fwyzard, I was also thinking along the lines of using cudaIsEnabled and rocmIsEnabled or better alpakaIsEnabled$TEST_ALPAKA_BACKEND or alpakaIsEnabled rocm|cuda|serial|tbb|.... We can change build rules so that unit test is run only if it explicitly depend on

SCRAM should run alpakaIsEnabled$TEST_ALPAKA_BACKEND for each supported backend once ( before the start of tests) and reuse the results.

@aandvalenzuela is interested in improving the GPU/Alpaka tests. if the above sounds good then she can already look in to implement it.

fwyzard commented 1 year ago

Looks good to me.

I would suggest alpakaIsEnabled$TEST_ALPAKA_BACKEND rather than alpakaIsEnabled rocm|cuda|serial|tbb|..., because it is much simpler to make the back-end choice at compile time (and so end up with a different binary for each back-end) than at runtime (to support the choice based on the command line argument).

We have actually done the latter for pixeltrack-standalone, but not in CMSSW where we are relying on the plug-in system.

smuzaffar commented 1 year ago

@makortel @fwyzard , https://github.com/cms-sw/cmssw-config/pull/95 implements the new unit test rules for cuda/rocm/alpaka backends. See https://github.com/cms-sw/cmssw-config/pull/95#issuecomment-1513484911 and let me know if this is sufficient?

fwyzard commented 1 year ago

@smuzaffar from the description it looks good.

Since the change impacts both "cuda" and "alpaka-cuda" tests, maybe it would actually be better to check cudaIsEnabled (and rocmIsEnabled for ROCm) instead of introducing an alpaka-specific check ?

makortel commented 1 year ago

@smuzaffar From the descriptions it looks good to me too. My only question is about tests that run cmsRun (e.g. directly or via a shell script). From https://github.com/cms-sw/cmssw-config/pull/95 I understand the behavior is driven by the dependencies of the test. Is the <iftest name="cuda"> (from https://github.com/cms-sw/cmssw/issues/41340#issuecomment-1508374403) the way to control the execution of those tests?

smuzaffar commented 1 year ago

@makortel , <iftest name="cuda"> is not needed any more. Execution of the test is driven by the test dependency itself. All tests with direct cuda deps (i.e. they have either <use name="cuda"> or alpaka-cuda backend enabled) will be skipped if cudaIsEnabled does not run successfully. Same for rocm based tests.

For dedicated GPU IBs, we can set USER_UNIT_TESTS=cuda which will force run only unit tests with cuda or alpaka-cuda dependency (all other tests will be skipped). This will also allow us to run GPU unit tests for PRs :-)

makortel commented 1 year ago

@smuzaffar Thanks, but I'm still confused :) Let's take a concrete example https://github.com/cms-sw/cmssw/blob/5b979f23402cd1811d30eef9163d008c2b37fcd4/HeterogeneousCore/AlpakaTest/test/BuildFile.xml#L6-L10

Here the goal is to run the test script in one way when cuda tests should be run (i.e. either the machine has a GPU or always in the GPU IBs), and in other way when there are no GPUs. The script internally calls cudaIsEnabled for the cpu option to enable the CUDA code path outside of GPU IBs if the machine has a GPU.

Should this be expressed along

<test name="testHeterogeneousCoreAlpakaTestModulesCUDA" command="testAlpakaModules.sh cuda">
  <use name="cuda"/>
</test>
<test name="testHeterogeneousCoreAlpakaTestModulesCPU" command="testAlpakaModules.sh cpu"/>

then? (and probably making the cpu option to not do anything if the machine has a GPU)

smuzaffar commented 1 year ago

yes @makortel that fregment of BuildFile should be change as you suggested.

testAlpakaModules.sh script can be updated to not check for cudaIsEnabled and always do what is passed via command-line.

fwyzard commented 1 year ago

testAlpakaModules.sh script can be updated to not check for cudaIsEnabled and always do what is passed via command-line.

... and adding a ROCm version.

smuzaffar commented 1 year ago

By the way, new build rules, to make use of USER_UNIT_TESTS=cuda to run cuda specific tests , are in 13.0.X and 13.1.X IBs.

makortel commented 7 months ago

@fwyzard This issue has been completed, right?

fwyzard commented 7 months ago

I don't remember what it was about... I'll have to re-read the thread and remind me of the details.

fwyzard commented 7 months ago

Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.

For example:

$ ./deviceVertexFinderByDensity_tROCmAsync

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
   0x7f06884fb731 <waitpid+33>: ja     0x7f06884fb788 <waitpid+120>
   0x7f06884fb733 <waitpid+35>: ret    
   0x7f06884fb734 <waitpid+36>: nop    DWORD PTR [rax+0x0]
   0x7f06884fb738 <waitpid+40>: push   r12
   0x7f06884fb73a <waitpid+42>: mov    r12d,edx
   0x7f06884fb73d <waitpid+45>: push   rbp
   0x7f06884fb73e <waitpid+46>: mov    rbp,rsi
#0  0x00007f06884fb72b in waitpid () from /lib64/libc.so.6
#1  0x00007f068845ccf7 in do_system () from /lib64/libc.so.6
#2  0x00007f068fd1881d in TUnixSystem::StackTrace() () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#3  0x00007f068fd181d4 in TUnixSystem::DispatchSignals(ESignals) () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#4  <signal handler called>
#5  std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6  std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7  alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8  main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================

The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
preferably using the command (.forum bug) in the ROOT prompt.
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs or (preferably) using the command (.gh bug) in
the ROOT prompt. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6  std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7  alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8  main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================
makortel commented 7 months ago

Playing around with scram b runtests_<test_name> I think having scram b runtests_deviceVertexFinderByDensity_t to run (or skip) the tests of all the backends of the test could be useful as well, rather than having to explicitly call scram b runtests_deviceVertexFinderByDensity_tROCmAsync.

fwyzard commented 7 months ago

Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.

This is actually as simple as

  if (cms::alpakatools::devices<Platform>().empty()) {
    std::cout << "No devices found, the test will be skipped.\n";
    exit(EXIT_SUCCESS);
  }

@makortel, do you think we need to wrap it in a function like cms::alpakatools::requireDevices<Platform>(), or is it simple enough that people can use those four lines directly ?

fwyzard commented 7 months ago

I've updated an example in #44036 .

makortel commented 7 months ago

This is actually as simple as

  if (cms::alpakatools::devices<Platform>().empty()) {
    std::cout << "No devices found, the test will be skipped.\n";
    exit(EXIT_SUCCESS);
  }

This is similar to what we had at one point (and still have with direct CUDA). My concern for this specific approach is that an infrastructure problem, e.g. in GPU IBs, would go unnoticed because the tests report success (by default scram uses exit code of cudaIsEnabled to decide where to run or skip cuda-dependent tests, but in GPU IB tests the USER_UNIT_TESTS=cuda is set to require all the GPU tests to be run).

If the test would return EXIT_FAILURE in this case (i.e. the test still fails, but with a meaningful message), I'd be fine with it.

I don't have strong feelings whether to abstract or copy-paste. Four lines isn't much, but it's still something. On the other hand, e.g. in Catch2-based tests one could do

  REQUIRE(not cms::alpakatools::devices<Platform>().empty());

which I wouldn't consider to be worth of abstracting.

fwyzard commented 7 months ago

Sure, I'm OK with returning EXIT_FAILURE.

In fact I coded it that way, then I switched to EXIT_SUCCESS because that's what we used in CUDA.

fwyzard commented 7 months ago

(will fix tomorrow)

fwyzard commented 7 months ago

For Catch2, currently we have a message at the beginning of the output coming from a cout:

image (original)

I'd like to move it to use a Catch2 macro like FAIL or INFO. Which one do you like better:

image (option 1)

or

image (option2)

or the original one ?

fwyzard commented 7 months ago

An other option, using WARN:

image (option 3)

fwyzard commented 7 months ago

44036 / #44042 should change all alpaka-based tests to fail with a warning message when there are no devices.

makortel commented 7 months ago

I think

image

would be sufficient, and the message is most concise.

After that, I'd prefer

image

over

image

fwyzard commented 7 months ago

Mhm... I don't see the quoted images :-(

Note that the message should be the same in all cases, it's just a matter of choice.

makortel commented 7 months ago

Ok, third attempt to quote pictures...

makortel commented 7 months ago

Going back to text then. My (not very strong) order of preferences would be

fwyzard commented 7 months ago

OK, so directly FAIL with the message, instead of issuing a INFO/WARN message and failing on the REQUIRE.

fwyzard commented 7 months ago

I'll update the PRs accordingly.