Open fwyzard opened 1 year ago
assign heterogeneous
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @fwyzard Andrea Bocci.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
https://github.com/cms-sw/cmssw/pull/41341 should address the second bullet, at least in part. To be checked if there are any other leftovers.
On the first point I wonder if we could (at least eventually) utilize scram itself to avoid running the tests in the first place that would fail because of missing hardware? @smuzaffar
Interesting idea.
For a CUDA-only test we could do something like this:
<bin name="cudaTest" file="cudaTest.cu">
<use name="cuda"/>
<flags TEST_RUNNER_CMD="cudaIsEnabled && ./cudaTest || echo 'Failed to initialise the CUDA runtime, the test will be skipped.'"/>
</bin>
For an Alpaka-based test there are two problems:
The first problem could be solved if SCRAM could provide an environment variable that expands to the name of the binary (or to its full path) that can be used inside TEST_RUNNER_CMD
, something like
<flags TEST_RUNNER_CMD="cudaIsEnabled && ./$TEST_BINARY_NAME || echo 'Failed to initialise the CUDA runtime, the test $TEST_BINARY_NAME will be skipped.'"/>
The second problem could be solved is SCRAM could provide an environment variable with the name of the Alpaka backend being tested, like SerialSync
or ROCmAsync
, something like
<flags TEST_RUNNER_CMD="alpakaIsEnabled$TEST_ALPAKA_BACKEND && ./$TEST_BINARY_NAME || echo 'Failed to initialise the $TEST_ALPAKA_BACKEND backend, the test $TEST_BINARY_NAME will be skipped.'"/>
Of course, it would be even nicer if SCRAM could automate this with a single flag, something like
<flags TEST_REQUIRE_ALPAKA_BACKEND/>
and report these tests as Skip
in the console, instead of Pass
/Fail
.
And to clarify: alpakaIsEnabled$TEST_ALPAKA_BACKEND
does not exist yet, but it's relatively trivial to write; I'll see if I can make a PR for it.
@fwyzard, I was also thinking along the lines of using cudaIsEnabled
and rocmIsEnabled
or better alpakaIsEnabled$TEST_ALPAKA_BACKEND or alpakaIsEnabled rocm|cuda|serial|tbb|...
. We can change build rules so that unit test is run only if it explicitly depend on
rocm
and rocmIsEnabled or alpakaIsEnabledROCmAsync
runs successfullycuda
and cudaIsEnabled or alpakaIsEnabledCudaAsync
runs successfullyalpaka
and alpakaIsEnabled$TEST_ALPAKA_BACKEND
(for selected backend) runs successfully.alpakaIsEnabled$TEST_ALPAKA_BACKEND
actually runs if hardware is available e.g. unit tests like
# For GPU IBs or if explicitly enabled cuda tests via env i.e. SCRAM_ENABLE_TESTS=cuda
<iftest name="cuda">
<test name="alpakaIsEnabledCudaAsync" command="alpakaIsEnabledCudaAsync"/>
</iftest>
# For ROMc Ibs or if explicitly enabled rocm tests via enve.g. SCRAM_ENABLE_TESTS=rocm
<iftest name="rocm">
<test name="alpakaIsEnabledROCmAsync" command="alpakaIsEnabledROCmAsync"/>
</iftest>
SCRAM should run alpakaIsEnabled$TEST_ALPAKA_BACKEND
for each supported backend once ( before the start of tests) and reuse the results.
@aandvalenzuela is interested in improving the GPU/Alpaka tests. if the above sounds good then she can already look in to implement it.
Looks good to me.
I would suggest alpakaIsEnabled$TEST_ALPAKA_BACKEND
rather than alpakaIsEnabled rocm|cuda|serial|tbb|...
, because it is much simpler to make the back-end choice at compile time (and so end up with a different binary for each back-end) than at runtime (to support the choice based on the command line argument).
We have actually done the latter for pixeltrack-standalone, but not in CMSSW where we are relying on the plug-in system.
@makortel @fwyzard , https://github.com/cms-sw/cmssw-config/pull/95 implements the new unit test rules for cuda/rocm/alpaka backends. See https://github.com/cms-sw/cmssw-config/pull/95#issuecomment-1513484911 and let me know if this is sufficient?
@smuzaffar from the description it looks good.
Since the change impacts both "cuda" and "alpaka-cuda" tests, maybe it would actually be better to check cudaIsEnabled
(and rocmIsEnabled
for ROCm) instead of introducing an alpaka-specific check ?
@smuzaffar From the descriptions it looks good to me too. My only question is about tests that run cmsRun (e.g. directly or via a shell script). From https://github.com/cms-sw/cmssw-config/pull/95 I understand the behavior is driven by the dependencies of the test. Is the <iftest name="cuda">
(from https://github.com/cms-sw/cmssw/issues/41340#issuecomment-1508374403) the way to control the execution of those tests?
@makortel , <iftest name="cuda">
is not needed any more. Execution of the test is driven by the test dependency itself. All tests with direct cuda
deps (i.e. they have either <use name="cuda">
or alpaka-cuda
backend enabled) will be skipped if cudaIsEnabled
does not run successfully. Same for rocm
based tests.
For dedicated GPU IBs, we can set USER_UNIT_TESTS=cuda
which will force run only unit tests with cuda
or alpaka-cuda
dependency (all other tests will be skipped). This will also allow us to run GPU unit tests for PRs :-)
@smuzaffar Thanks, but I'm still confused :) Let's take a concrete example https://github.com/cms-sw/cmssw/blob/5b979f23402cd1811d30eef9163d008c2b37fcd4/HeterogeneousCore/AlpakaTest/test/BuildFile.xml#L6-L10
Here the goal is to run the test script in one way when cuda
tests should be run (i.e. either the machine has a GPU or always in the GPU IBs), and in other way when there are no GPUs. The script internally calls cudaIsEnabled
for the cpu
option to enable the CUDA code path outside of GPU IBs if the machine has a GPU.
Should this be expressed along
<test name="testHeterogeneousCoreAlpakaTestModulesCUDA" command="testAlpakaModules.sh cuda">
<use name="cuda"/>
</test>
<test name="testHeterogeneousCoreAlpakaTestModulesCPU" command="testAlpakaModules.sh cpu"/>
then? (and probably making the cpu
option to not do anything if the machine has a GPU)
yes @makortel that fregment of BuildFile should be change as you suggested.
USER_UNIT_TESTS=cuda
(e.g. for GPU IBs), scram will only run testHeterogeneousCoreAlpakaTestModulesCUDA
and skip testHeterogeneousCoreAlpakaTestModulesCPU
(as this does not depend on cuda
)USER_UNIT_TESTS
), scram will run testHeterogeneousCoreAlpakaTestModulesCUDA
if cudaIsEnabled
is successful e.g. on our ppc64le nodes (where we have GPU). scram will always run testHeterogeneousCoreAlpakaTestModulesCPU
.testAlpakaModules.sh
script can be updated to not check for cudaIsEnabled
and always do what is passed via command-line.
testAlpakaModules.sh
script can be updated to not check forcudaIsEnabled
and always do what is passed via command-line.
... and adding a ROCm version.
By the way, new build rules, to make use of USER_UNIT_TESTS=cuda
to run cuda
specific tests , are in 13.0.X and 13.1.X IBs.
@fwyzard This issue has been completed, right?
I don't remember what it was about... I'll have to re-read the thread and remind me of the details.
Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.
For example:
$ ./deviceVertexFinderByDensity_tROCmAsync
*** Break *** segmentation violation
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
0x7f06884fb731 <waitpid+33>: ja 0x7f06884fb788 <waitpid+120>
0x7f06884fb733 <waitpid+35>: ret
0x7f06884fb734 <waitpid+36>: nop DWORD PTR [rax+0x0]
0x7f06884fb738 <waitpid+40>: push r12
0x7f06884fb73a <waitpid+42>: mov r12d,edx
0x7f06884fb73d <waitpid+45>: push rbp
0x7f06884fb73e <waitpid+46>: mov rbp,rsi
#0 0x00007f06884fb72b in waitpid () from /lib64/libc.so.6
#1 0x00007f068845ccf7 in do_system () from /lib64/libc.so.6
#2 0x00007f068fd1881d in TUnixSystem::StackTrace() () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#3 0x00007f068fd181d4 in TUnixSystem::DispatchSignals(ESignals) () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#4 <signal handler called>
#5 std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6 std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7 alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8 main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================
The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
preferably using the command (.forum bug) in the ROOT prompt.
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs or (preferably) using the command (.gh bug) in
the ROOT prompt. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5 std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6 std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7 alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8 main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================
Playing around with scram b runtests_<test_name>
I think having scram b runtests_deviceVertexFinderByDensity_t
to run (or skip) the tests of all the backends of the test could be useful as well, rather than having to explicitly call scram b runtests_deviceVertexFinderByDensity_tROCmAsync
.
Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.
This is actually as simple as
if (cms::alpakatools::devices<Platform>().empty()) {
std::cout << "No devices found, the test will be skipped.\n";
exit(EXIT_SUCCESS);
}
@makortel, do you think we need to wrap it in a function like cms::alpakatools::requireDevices<Platform>()
, or is it simple enough that people can use those four lines directly ?
I've updated an example in #44036 .
This is actually as simple as
if (cms::alpakatools::devices<Platform>().empty()) { std::cout << "No devices found, the test will be skipped.\n"; exit(EXIT_SUCCESS); }
This is similar to what we had at one point (and still have with direct CUDA). My concern for this specific approach is that an infrastructure problem, e.g. in GPU IBs, would go unnoticed because the tests report success (by default scram
uses exit code of cudaIsEnabled
to decide where to run or skip cuda
-dependent tests, but in GPU IB tests the USER_UNIT_TESTS=cuda
is set to require all the GPU tests to be run).
If the test would return EXIT_FAILURE
in this case (i.e. the test still fails, but with a meaningful message), I'd be fine with it.
I don't have strong feelings whether to abstract or copy-paste. Four lines isn't much, but it's still something. On the other hand, e.g. in Catch2-based tests one could do
REQUIRE(not cms::alpakatools::devices<Platform>().empty());
which I wouldn't consider to be worth of abstracting.
Sure, I'm OK with returning EXIT_FAILURE
.
In fact I coded it that way, then I switched to EXIT_SUCCESS
because that's what we used in CUDA.
(will fix tomorrow)
For Catch2, currently we have a message at the beginning of the output coming from a cout
:
(original)
I'd like to move it to use a Catch2 macro like FAIL
or INFO
.
Which one do you like better:
(option 1)
or
(option2)
or the original one ?
An other option, using WARN
:
(option 3)
I think
would be sufficient, and the message is most concise.
After that, I'd prefer
over
Mhm... I don't see the quoted images :-(
Note that the message should be the same in all cases, it's just a matter of choice.
Ok, third attempt to quote pictures...
Going back to text then. My (not very strong) order of preferences would be
OK, so directly FAIL
with the message, instead of issuing a INFO
/WARN
message and failing on the REQUIRE
.
I'll update the PRs accordingly.
I'm collecting some "to do" items to improve the alpaka-related developments:
cms::cudatest::requireDevices()
; possibly also some explicit support for Catch2 tests.DataFormats/Portable/interface/Product.h
,HeterogeneousCore/AlpakaCore/interface/ScopedContext.h
, etc.