Closed makortel closed 10 months ago
assign core
FYI @pcanal
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Thanks @makortel! I was running them locally and I can reproduce the errors in both ibmminsky-1 and 2, but I do not see any potential culprit in the list of merged PRs
I have been looking further into this issue and it seems it is caused by https://github.com/cms-sw/cmssw/pull/41590. If I use the latest IBs and revert the changes made in this PR, relval workflows run fine. I see @thesps is the original author of the PR. Could you please have a look at it?
In any case, I am not sure still why it is only failing on the ppc64le architecture. FYI, @makortel @smuzaffar @aloeliger
Is there a quick recipe to recreate this?
@aloeliger , you need to run one of the failing workflows on our powerpc nodes. please email me your cern login id and I will allow you access to these nodes.
@aloeliger, what I have done for testing is to use SCRAM to create a developer area with the latest IB (CMSSW_13_2_X_2023-05-15-2300
in this case), then I manually reverted the changes of this PR by cloning only the touched modules with git cms-addpkg
, compile and run the failed relvals.
I also tested it by taking the latest successful IB (CMSSW_13_2_X_2023-05-10-2300
in this case) and using git cms-merge-topic
to only apply the changes of the culprit PR.
But yes, I was logged in one of our ppc64le nodes.
By quick look of https://github.com/cms-sw/cmssw/pull/41590 it is far from clear why it would play a role in this problem. The PR changes l1t::PFJet
by changing one data member from std::array<uint64_t, 2>
to std::array<std::array<uint64_t, 2>, 2>
.
But the stack traces in the issue description are about trigger::TriggerObject
(and reco::Muon
) that have no relaton to l1t::PFJet
. So I'm not sure if we can get much further without ROOT expertise (for which let me tag also @vgvassilev)
@pcanal, I am looking at that stack trace and I am thinking if we forgot to lock somewhere our execution engine?
I also confirm that reverting the https://github.com/cms-sw/cmssw/pull/41590 fixes the issue for ppc64le. Really no idea why changing data member from std::array<uint64_t, 2>
to std::array<std::array<uint64_t, 2>, 2>
cause this issue. I thought gpu on power node might be playing any roll in this crash ( as wfs are only failing for ppc64le) but disabling gpu and makeing sure that we load nvidia stub libs still causes this failure.
Does this seem to be an L1 trigger issue at this point?
@aloeliger , you need to run one of the failing workflows on our powerpc nodes. please email me your cern login id and I will allow you access to these nodes.
I'm holding off just slightly on this trying to get the original developer to maybe debug this if this is an L1 issue.
Oh, ppc64le explains it. The llvm JIT support for the powerpc architecture is limited and we might be hitting a but in the JIT.
Does this seem to be an L1 trigger issue at this point?
At this point the problem does not seem to be in L1 code (even if it somehow triggered the problem).
https://github.com/cms-sw/cmssw/pull/41706 fixes the failing workflows for ppc64le. I have no idea why :-( may be ap_int.h
from https://github.com/cms-sw/cmssw/blob/master/DataFormats/L1TParticleFlow/interface/gt_datatypes.h#LL8C11-L8C19 is cauisng llvm JIT to fail?
The ap_int,h
etc gets exposed to cling
in other DataFormat code, so I don't think it is a general problem with the ap headers. Maybe we just end up shuffling some stuff around in memory so that the bug (whatever it is) manifests itself or not? :)
Hi, there has been another occurrence of those failures in cling::IncrementalExecutor::runStaticInitializersOnce
on CMSSW_13_2_X_2023-06-23-2300 IBs.
#14 0x000010094be10188 in ?? ()
#15 0x000010081b722c48 in cling::IncrementalExecutor::runStaticInitializersOnce(cling::Transaction&) () from /cvmfs/cms-ib.cern.ch/sw/ppc64le/nweek-02790/el8_ppc64le_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-23-2300/external/el8_ppc64le_gcc11/lib/libCling.so
#16 0x000010081b688394 in cling::Interpreter::executeTransaction(cling::Transaction&) () from /cvmfs/cms-ib.cern.ch/sw/ppc64le/nweek-02790/el8_ppc64le_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-23-2300/external/el8_ppc64le_gcc11/lib/libCling.so
Can we somehow get a minimal reproducing example?
Hi @vgvassilev, I have prepared a CMSSW area using a debug build of ROOT in one of our ppc64le machines. Here are the steps to reproduce the issue:
lxplus
as your cern login user, you should be able to login with ssh cmsbuild@ibmminsky-1
./cvmfs/cms.cern.ch/common/cmssw-el8 --bind /scratch:/scratch --nv
/scratch/root-debug/CMSSW_13_2_X_2023-06-26-2300
export SCRAM_ARCH=el8_ppc64le_gcc11
src
directory and run cmsenv
.which root
. It should point to /scratch/root-debug/CMSSW_13_2_X_2023-06-26-2300/external/el8_ppc64le_gcc11/bin/root
instead of the cvmfs installation.141.001
, runTheMatrix.py -i all -l 141.001 -t 4 --ibeos
.I could reproduce the errors following these steps, let me know if they work for you too. Thanks!
@aandvalenzuela, thanks a lot. I am back from a workshop and I am still digging myself out.
Hello @vgvassilev, is the debug build helping with this issue? Please, let me know if I can provide something else :)
Hi @aandvalenzuela, it seems that I am not part of the cmsbuild group. I cannot seem to log in.
Hello @vgvassilev, I checked and you should be able to log in now. Can you retry?
I can log in now, but I get:
********** ERROR: Missing Release top ************
The release area "/cvmfs/cms-ib.cern.ch/sw/ppc64le/week1/el8_ppc64le_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-26-2300"
for "CMSSW" version "CMSSW_13_2_X_2023-06-26-2300" is not available/usable.
In case this release has been deprecated, you can move your code to
one of the following release(s) of release series "CMSSW_13_2".
CMSSW_13_2_0_pre1
CMSSW_13_2_0_pre2
CMSSW_13_2_X_2023-07-09-2300
CMSSW_13_2_X_2023-07-10-2300
CMSSW_13_2_X_2023-07-11-2300
CMSSW_13_2_X_2023-07-12-2300
CMSSW_13_2_X_2023-07-13-2300
CMSSW_13_2_X_2023-07-14-2300
CMSSW_13_2_X_2023-07-16-2300
@vgvassilev , once logged in to ibmminsky node, please do the following to setup the cmssw env
cd /scratch/cmsbuild
/cvmfs/cms.cern.ch/common/cmssw-el8 --bind /scratch:/scratch --nv
Singularity> scram p /scratch/root-debug/CMSSW_13_2_X_2023-06-26-2300
Singularity> cd CMSSW_13_2_X_2023-06-26-2300/
Singularity> cmsenv
Hi @smuzaffar, thanks works. Now I get:
Singularity> runTheMatrix -i all -l 141.001 -t 4 --ibeos
bash: runTheMatrix: command not found
did you run cmsenv
and were there any errors? May be try
eval `scram run -sh`
which is what cmsenv
does
I get:
-bash: /afs/cern.ch/user/c/cmsbuild/.bash_profile: Permission denied
-bash-4.2$ cd /scratch/cmsbuild
-bash-4.2$ /cvmfs/cms.cern.ch/common/cmssw-el8 --bind /scratch:/scratch --nv
bash: /afs/cern.ch/user/c/cmsbuild/.bashrc: Permission denied
Singularity> scram p /scratch/root-debug/CMSSW_13_2_X_2023-06-26-2300
WARNING: There already exists /scratch/cmsbuild/CMSSW_13_2_X_2023-06-26-2300 area for SCRAM_ARCH el8_ppc64le_gcc11.
Singularity> cd CMSSW_13_2_X_2023-06-26-2300/
Singularity> cmsenv
Singularity> runTheMatrix.py -i all -l 141.001 -t 4 --ibeos
bash: runTheMatrix.py: command not found
Singularity> eval `scram run -sh`
Singularity> runTheMatrix.py -i all -l 141.001 -t 4 --ibeos
bash: runTheMatrix.py: command not found
ah looks like you are not using the ~cmsbuild/public/lxplus
to login. From lxplus (using your account) run ~cmsbuild/public/lxplus
to login as cmsbuild
to lxplus (which will generate proper krb5 token for cmsbuild) and then ssh cmsbuild@ibmminsky-1
to login to ppc64le node
Ok, that makes bash happier, but:
ibmminsky-1:~> cd /scratch/cmsbuild
ibmminsky-1:cmsbuild> /cvmfs/cms.cern.ch/common/cmssw-el8 --bind /scratch:/scratch --nv
Singularity> scram p /scratch/root-debug/CMSSW_13_2_X_2023-06-26-2300
WARNING: There already exists /scratch/cmsbuild/CMSSW_13_2_X_2023-06-26-2300 area for SCRAM_ARCH el8_ppc64le_gcc11.
Singularity> cd CMSSW_13_2_X_2023-06-26-2300/
Singularity> cmsenv
Singularity> runTheMatrix -i all -l 141.001 -t 4 --ibeos
bash: runTheMatrix: command not found
Singularity> eval `scram run -sh`
Singularity> runTheMatrix -i all -l 141.001 -t 4 --ibeos
bash: runTheMatrix: command not found
Singularity> ls
biglib bin cfipython config doc include lib logs objs python src static test tmp
Singularity> pwd
/scratch/cmsbuild/CMSSW_13_2_X_2023-06-26-2300
Singularity>
@aandvalenzuela , looks like you did not build full cmssw with debug root. I see that /scratch/root-debug/CMSSW_13_2_X_2023-06-26-2300/bin/el8_ppc64le_gcc11
is empty
Thanks! I am building it now. My bad, I did not notice it since the IB was available when I reproduced the failures
Till May 22nd we had same errors [a] which were fixed by https://github.com/cms-sw/cmssw/pull/41706 [b]. the last successful ppc64le IB was on June 19th 23H00 IB and next ppc64le built on 22nd June 23h had the same error [c]. https://github.com/cms-sw/cmssw/compare/dba33c9327ba9e8a916cfdaece1057dd8e805b73...c34a3182b11e08fde1652fcdeab4ef1a76028908 are the changes which were merged in cmssw between Jun 19th - Jun 22nd (also we update few externals mainly cuda 11.8 and gcc 11.4). I wll check if ap_int.h
is again causing this problem ( https://github.com/cms-sw/cmssw/issues/41658#issuecomment-1551427424 )
[a]
Release/arch | workflow/step | #exitcode |
---|---|---|
CMSSW_13_2_X_2023-05-21-2300/el8_ppc64le_gcc11 | 4.22/step2 | 62,720 |
[b] | Release/arch | workflow/step | #exitcode |
---|---|---|---|
CMSSW_13_2_X_2023-05-22-2300/el8_ppc64le_gcc11 | 4.22/step2 | 0 |
[c] | Release/arch | workflow/step | #exitcode |
---|---|---|---|
CMSSW_13_2_X_2023-06-22-2300/el8_ppc64le_gcc11 | 4.22/step2 | 62,720 |
I certainly see inclusion of ap_int.h
in L1Trigger/Phase2L1ParticleFlow/plugins/L1NNTauProducer.cc
@vgvassilev , i think this inclusion is fine as it is in cmssw plugin. Previously root cling failed as this header was part of root dictionary. May be there is some indirect include which is bring this header in to one of cmssw root dict
may https://github.com/cms-sw/cmssw/compare/dba33c9327ba9e8a916cfdaece1057dd8e805b73...c34a3182b11e08fde1652fcdeab4ef1a76028908#diff-1194e2aafd6ecf87da700631ba27cb1f104228e8a16370ce256a3971c49d25e1R8 which brings in https://github.com/cms-sw/cmssw/compare/dba33c9327ba9e8a916cfdaece1057dd8e805b73...c34a3182b11e08fde1652fcdeab4ef1a76028908#diff-1194e2aafd6ecf87da700631ba27cb1f104228e8a16370ce256a3971c49d25e1R8
That does not open for me anything.
Hello @vgvassilev,
In order to provide a better way to debug ROOT issues, we have set up a new IB with ROOT already in debug mode that we can trigger on-demand (ROOTDBG_X
). That makes it simpler to reproduce the issues, also from ibmminsky-1
as cmsbuild
you can run:
/cvmfs/cms.cern.ch/common/cmssw-el8
/tmp
export SCRAM_ARCH=el8_ppc64le_gcc11
source /cvmfs/cms.cern.ch/cmsset_default.sh
scram -a $SCRAM_ARCH project CMSSW_13_3_ROOTDBG_X_2023-07-20-1100
CMSSW_13_3_ROOTDBG_X_2023-07-20-1100/src
directory and run cmsenv
runTheMatrix.py -i all -l 141.001 -t 4 --ibeos
I have just tested it on the same machine (/tmp/avalenzu/CMSSW_13_3_ROOTDBG_X_2023-07-20-1100/src/141.001_RunMuon2023B/step3_RunMuon2023B.log
), so I hope there are no issues this time.
Thanks! Andrea.
@vgvassilev using CMSSW_13_3_ROOTDBG_X_2023-07-20-1100
(which has root , root/llvm build in debug mode) I get the following
#14 0x00003ff6c1360188 in ?? ()
#15 0x00003ff75f24dab0 in cling::IncrementalExecutor::executeInit (this=0x3fff74201100, function=...) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/IncrementalExecutor.h:281
#16 0x00003ff75f249fa0 in cling::IncrementalExecutor::runStaticInitializersOnce (this=0x3fff74201100, T=...) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:304
#17 0x00003ff75f077530 in cling::Interpreter::executeTransaction (this=0x3fff75650a80, T=...) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/Interpreter.cpp:1714
#18 0x00003ff75f29a938 in cling::IncrementalParser::commitTransaction (this=0x3fff752f7400, PRT=..., ClearDiagClient=true) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:675
#19 0x00003ff75f29b5a8 in cling::IncrementalParser::Compile (this=0x3fff752f7400, input=..., Opts=...) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:846
#20 0x00003ff75f075188 in cling::Interpreter::DeclareInternal (this=0x3fff75650a80, input=..., CO=..., T=0x0) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/Interpreter.cpp:1353
#21 0x00003ff75f0733e4 in cling::Interpreter::parseForModule (this=0x3fff75650a80, input=...) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/interpreter/cling/lib/Interpreter/Interpreter.cpp:943
#22 0x00003ff75ed194ac in ExecAutoParse (what=0x3ff713b67048 "\n#line 1 \"DataFormatsHLTReco_xr dictionary payload\"\n\n#ifndef CMS_DICT_IMPL\n #define CMS_DICT_IMPL 1\n#endif\n#ifndef _REENTRANT\n #define _REENTRANT 1\n#endif\n#ifndef GNUSOURCE\n #define GNUSOURCE 1\n#en"..., header=false, interpreter=0x3fff75650a80) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/core/metacling/src/TCling.cxx:6282
#23 0x00003ff75ed19e90 in TCling::AutoParseImplRecurse (this=0x3fff7533a800, cls=0x3ff70fc75e60 "trigger::TriggerObject", topLevel=true) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/core/metacling/src/TCling.cxx:6387
#24 0x00003ff75ed1a5f0 in TCling::AutoParse (this=0x3fff7533a800, cls=0x3ff70fc75e60 "trigger::TriggerObject") at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/core/metacling/src/TCling.cxx:6472
#25 0x00003fff78d0d18c in TClass::LoadClassInfo (this=0x3ff71160f780) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/core/meta/src/TClass.cxx:5822
#26 0x00003fff78d06b00 in TClass::GetMethodWithPrototype (this=0x3ff71160f780, method=0x3fffda937898 "eta", proto=0x3fffda937180 "", objectIsConst=true, mode=ROOT::kConversionMatch) at /scratch/cmsbuild/jenkins_b/workspace/build-any-ib/w/BUILD/el8_ppc64le_gcc11/lcg/root/6.26.11-87279561cc6487d2183bf0f1301cdedf/root-6.26.11/core/meta/src/TClass.cxx:4446
#27 0x00003fff7a27d0f8 in edm::TypeWithDict::functionMemberByName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) const () from /cvmfs/cms-ib.cern.ch/sw/ppc64le/week0/el8_ppc64le_gcc11/cms/cmssw/CMSSW_13_3_ROOTDBG_X_2023-07-20-1100/lib/el8_ppc64le_gcc11/libFWCoreReflection.so
#28 0x00003ff7414d917c in reco::findMethod (t=..., name=..., args=..., fixuppedArgs=..., iIterator=0x3fffda938544 "eta) < 2.0", oError=@0x3fffda937538: 1) at /scratch/cmsbuild/test/CMSSW_13_3_ROOTDBG_X_2023-07-20-1100/src/CommonTools/Utils/src/findMethod.cc:168
#29 0x00003ff741496014 in reco::parser::MethodSetter::push (this=0x3ff6d85fa6b8, name=..., args=..., begin=0x3fffda938544 "eta) < 2.0", deep=<optimized out>) at /scratch/cmsbuild/test/CMSSW_13_3_ROOTDBG_X_2023-07-20-1100/src/CommonTools/Utils/src/MethodSetter.cc:61
#30 0x00003ff741497b58 in reco::parser::MethodSetter::operator() (this=0x3ff6d85fa6b8, begin=0x3fffda938544 "eta) < 2.0", end=<optimized out>) at /scratch/cmsbuild/test/CMSSW_13_3_ROOTDBG_X_2023-07-20-1100/src/CommonTools/Utils/src/MethodSetter.cc:51
#31 0x00003ff7414d1240 in boost::spirit::classic::attributed_action_policy<boost::spirit::classic::nil_t>::call<reco::parser::MethodSetter, char const*> (last=<optimized out>, first=<synthetic pointer>: <optimized out>, actor=...) at /cvmfs/cms-ib.cern.ch/sw/ppc64le/nweek-02794/el8_ppc64le_gcc11/external/boost/1.80.0-aad9357db038a7b4cef32f5bd3ac318a/include/boost/spirit/home/classic/core/scanner/scanner.hpp:141
#32 boost::spirit::classic::action_policy::do_action<reco::parser::MethodSetter, boost::spirit::classic::nil_t, char const*> (last=<optimized out>, first=<synthetic pointer>: <optimized out>, val=<synthetic pointer>..., actor=..., this=0x3fffda938000) at /cvmfs/cms-ib.cern.ch/sw/ppc64le/nweek-02794/el8_ppc64le_gcc11/external/boost/1.80.0-aad9357db038a7b4cef32f5bd3ac318a/include/boost/spirit/home/classic/core/scanner/scanner.hpp:162
#33 boost::spirit::classic::action<boost::spirit::classic::contiguous<boost::spirit::classic::sequence<boost::spirit::classic::alpha_parser, boost::spirit::classic::kleene_star<boost::spirit::classic::chset<char> > > >, reco::parser::MethodSetter>::parse<boost::spirit::classic::scanner<char const*, boost::spirit::classic::scanner_policies<boost::spirit::classic::skipper_iteration_policy<boost::spirit::classic::iteration_policy>, boost::spirit::classic::match_policy, boost::spirit::classic::action_policy> > > (scan=..., this=0x3ff6d85fa6a8) at /cvmfs/cms-ib.cern.ch/sw/ppc64le/nweek-02794/el8_ppc64le_gcc11/external/boost/1.80.0-aad9357db038a7b4cef32f5bd3ac318a/include/boost/spirit/home/classic/core/composite/actions.hpp:117
#34 boost::spirit::classic::sequence<boost::spirit::classic::action<boost::spirit::classic::contiguous<boost::spirit::classic::sequence<boost::spirit::classic::alpha_parser, boost::spirit::classic::kleene_star<boost::spirit::classic::chset<char> > > >, reco::parser::MethodSetter>, boost::spirit::classic::optional<boost::spirit::classic::sequence<boost::spirit::classic::chlit<char>, boost::spirit::classic::chlit<char> > > >::parse<boost::spirit::classic::scanner<char const*, boost::spirit::classic::scanner_policies<boost::spirit::classic::skipper_iteration_policy<boost::spirit::classic::iteration_policy>, boost::spirit::classic::match_policy, boost::spirit::classic::action_policy> > > (this=0x3ff6d85fa6a8, scan=...) at /cvmfs/cms-ib.cern.ch/sw/ppc64le/nweek-02794/el8_ppc64le_gcc11/external/boost/1.80.0-aad9357db038a7b4cef32f5bd3ac318a/include/boost/spirit/home/classic/core/composite/sequence.hpp:60
@vgvassilev, The latest IB to reproduce this issue is CMSSW_13_3_ROOTDBG_X_2023-07-26-1100
now.
@vgvassilev , did you manage to look in to this?
@vgvassilev , did you manage to look in to this?
Yes, I logged in yesterday but it did not crash for me. I will retry today.
@vgvassilev , here are simple steps to reproduce this after logging in as cmsbuild to ibmminsky-1 node
> cd /scratch/cmsbuild/issue41658/CMSSW_13_3_ROOTDBG_X_2023-07-26-1100/4.22/
> /cvmfs/cms.cern.ch/common/cmssw-el8 -B /scratch
Singularity> cmsenv
Singularity> export CMS_PATH="/cvmfs/cms-ib.cern.ch"
Singularity> export SITECONFIG_PATH="/cvmfs/cms-ib.cern.ch/SITECONF/local"
Singularity> cmsRun step2_RAW2DIGI_L1Reco_RECO_DQM.py
Yes, I can reproduce it now! Do we have the build of ROOT with -DLLVM_BUILD_TYPE=Debug
?
@vgvassilev , all ROOTDBG
IBs have root built with -DLLVM_BUILD_TYPE=Debug
so CMSSW_13_3_ROOTDBG_X_2023-07-26-1100
should already had the debug built of root. Latest cmssw ib with debug root/llvm is CMSSW_13_3_ROOTDBG_X_2023-09-14-1100
.
@vgvassilev , all
ROOTDBG
IBs have root built with-DLLVM_BUILD_TYPE=Debug
soCMSSW_13_3_ROOTDBG_X_2023-07-26-1100
should already had the debug built of root. Latest cmssw ib with debug root/llvm isCMSSW_13_3_ROOTDBG_X_2023-09-14-1100
.
This seems to be a problem with the RuntimeDyld part of the LLVM JIT in the ppc64 backend. We are trying to phase out this logic altogether with the upgrade to llvm16 and possibly backporting the new JIT ppc64 backend (as discussed in https://github.com/root-project/root/pull/13273#issuecomment-1664374323).
Is this issue critical to fix assuming I am not sure what the underlying issue is?
@bzEq, apologies for summoning you on a random issue, but this software stack is the main motivation of pursuing the JitLink backend for ppc64 (via the ROOT project and Cling). Once we land https://github.com/cms-sw/cmssw/issues/41658 we can backport the new JitLink to it and deploy it on a major scientific workflow. That can probably happen now assuming we know which patches we need to backport to llvm16 to enable the ppc jitlink. Could you help us out?
Independently on that, are you interested in resolving issues with regard to ppc64 and RuntimeDyld of the kind described here?
cc: @lhames
Is this issue critical to fix assuming I am not sure what the underlying issue is?
@vgvassilev , ppc64le
is one of validated architecture so yes it is critical for CMS to keep it in good health.
Could you help us out?
Sure. Do we have to make jitlink default on ppc64 in trunk or in downstream llvm-16 when porting the ppc64 backend.
I'm looking for these commits and some conflicts have to be resolved, I'll post PR once I have done it. Upd: Basically, we have to backport
1dae4dd0d80f [JITLink][PowerPC] Fix incorrect assertion of addend for R_PPC64_REL24
94239712eb17 Fix typos in comments of ExecutionEngine (NFC)
b6e2eac2930e [JITLink][PowerPC] Add relocations included in rtdyld but missing from jitlink
d6791fb77402 [JITLink][PowerPC] Fix relocations in stubs for ppc64 big-endian target
9c38a178d3a6 [JITLink][PowerPC] Add basic TLS support for ppc64
5cb2a78ac2fe [Orc][PowerPC] Enable ELFNixPlatform support for ppc64le
ca6d86f6bf12 [JITLink][PowerPC] Support R_PPC64_PCREL34
11a02de7829a [JITLink][PowerPC] Change method to check if a symbol is external to current object
7bf9c5bbb7d1 [JITLink] ppc64.h - fix MSVC "not all control paths return a value" warning. NFC.
995f199f0a76 [JITLink][PowerPC] Correct handling of R_PPC64_REL24_NOTOC
74f2a76904d7 [JITLink] Rename TableManager::appendEntry, add comment.
79786c4d23f1 [JITLink][PowerPC] Fixed unused variable warning. NFC.
61358d4fbeb3 [JITLink][PowerPC] Add TOC and relocations for ppc64
52b88457baf8 [JITLink] Use SubtargetFeatures to store features in LinkGraph
846bde483d63 Silence switch statement contains 'default' but no 'case' labels warning; NFC
8313507a7c3f [JITLink][ELF][ppc64] Add skeleton ppc64 support and ELF/ppc64 JITLink backend.
BTW, does https://github.com/root-project/llvm-project accept any PR?
Many workflows failed in CMSSW_13_2_X_2023-05-11-2300 el8_ppc64le_gcc11 with
(from https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_ppc64le_gcc11/CMSSW_13_2_X_2023-05-11-2300/pyRelValMatrixLogs/run/136.7801_RunHLTPhy2017B_AOD/step2_RunHLTPhy2017B_AOD.log#/)