Closed Dr15Jones closed 4 years ago
A new Issue was created by @Dr15Jones Chris Jones.
@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Adding an assert
to the code just before the problem, I was able to look at the variables in the debugger. The problem was at this line
https://github.com/cms-sw/cmssw/blob/fe5b448929b4fec466979097e4e4faad8cf8666d/L1Trigger/L1TMuonEndCap/src/bdt/Node.cc#L327
where the value of sv
comes from the member variable splitVariable
and has the value of 9 and the rest of the object
print *this
$4 = {name = "root left left left left", leftDaughter = 0x60d005371dc0, rightDaughter = 0x60d005371e90, parent = 0x60d005371a80, splitValue = 1, splitVariable = 9, errorReduction = -1, totalError = -1, avgError = -1,
fitValue = 0.00040209802682511508, numEvents = -1094795586, events = std::vector of length 0, capacity 0}
while e->data has size 8
(gdb) print *e
$3 = {trueValue = 0, predictedValue = 0.13586474777853488, DTPt = 0, CSCPt = 0, tmvaPt = 0, tmvaPt1 = 0, Mode = 0, Quality = 0, static sortingIndex = 1, id = 0, data = std::vector of length 8, capacity 8 = {1, 1.4750000238418579, -9,
-11, 1, -1, 0, 0}}
So there is a problem with the algorithm.
assign l1
New categories assigned: l1
@benkrikler,@rekovic you have been requested to review this Pull request/Issue and eventually sign? Thanks
The size of e->data
is hard coded by the calling algorithm to by 8. See
https://github.com/cms-sw/cmssw/blob/fe5b448929b4fec466979097e4e4faad8cf8666d/L1Trigger/L1TMuonEndCap/src/PtAssignmentEngine2016.cc#L635-L664
So this seems to imply that the boosted decision tree being used is incompatible with the actual code calling it.
We have 96 failing workflows in the ASAN IB and based on a random sampling of them it appears they all are from this problem.
I've been unable to find old ASAN reports, so based solely on recent pull requests, the most likely culprints are #29260 or #29080
ping
Is anyone looking at this? I think it is also causing crashes in the ROOT6 IBs.
I'm sorry for the delay in getting back. Could you check if the failing workflows are all using Run2_2016 era? If so, I think it's related to https://github.com/cms-sw/cmssw/issues/29237 . I believe the main problem is due to loading the wrong global tag, and thus the wrong BDT trees. I hope there is no new issue in Node.cc
.
@jiafulow you can actually look yourself by going to https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_11_1/2020-04-24-2300?selectedArchs=slc7_amd64_gcc820&selectedFlavors=ASAN_X&selectedStatus=failed
which was the most recent run of address sanitizer (ASAN). It looks like each of the failures labelled 256
are caused by this problem. It appears there are 93 workflows failing under ASAN because of this problem.
@jiafulow Any updates? The ASAN is still reporting the issue, e.g. here for workflow 1310.0 https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_1_ASAN_X_2020-05-01-2300/pyRelValMatrixLogs/run/1310.0_ADDMonoJet_d3MD3_13+ADDMonoJet_d3MD3_13INPUT+DIGIUP15+RECOUP15+HARVESTUP15/step2_ADDMonoJet_d3MD3_13+ADDMonoJet_d3MD3_13INPUT+DIGIUP15+RECOUP15+HARVESTUP15.log#/
assign alca
New categories assigned: alca
@christopheralanwest,@tlampen,@pohsun,@tocheng you have been requested to review this Pull request/Issue and eventually sign? Thanks
@christopheralanwest,@tlampen,@pohsun,@tocheng
I believe the main problem is due to loading the wrong global tag, and thus the wrong BDT trees.
Could you please help to double-check if the conditions payload(s) related to L1TMuonEndCapTrackProducer
in e.g. workflow 1310.0 are correct or not? Thanks.
@makortel using the provided link (updated to 2020-05-01-2300) , the following workflows that failed with error code 256 (incl 1310.0) are all using '--era Run2_2016':
(I didn't check beyond 1314.0. Note that 534.0 and 536.0 failures are not due to EMTF.)
As I wrote above, I believe it's related to the issue #29237 which was due to GlobalTag confusion. Unfortunately I don't really know how to fix it. It would require experts who know how to fix the GTs and how to load the correct GTs for the Run2_2016 era in 'runTheMatrix'.
Thanks @jiafulow, let's try to push #29237 to be fixed then.
Let me add that I find it disturbing that wrong conditions payload causes the code to misbehave this way. It would be much better that the code either always works according to the payload, or would be able check upfront if the payload is inconsistent with the code.
Thanks for the suggestion, this is something we need to do better. In this particular case, the payload is a set of BDT trees. The number of trees and the number of variables, as well as the definitions of the variables, are different for different eras (2016 vs 2017/18). When there is a mismatch between the payload and the C++ codes, it crashes.
When there is a mismatch between the payload and the C++ codes, it crashes.
Now it appears to not to crash but to produce undefined results.
@jiafulow , We still have this error in ASAN IBs[a], I debugged it a bit an it happens when splitVariable >= 8 ( https://github.com/cms-sw/cmssw/blob/master/L1Trigger/L1TMuonEndCap/src/bdt/Node.cc#L327 ) as @Dr15Jones mentioned the event data has fixed size of 8 ( https://github.com/cms-sw/cmssw/blob/master/L1Trigger/L1TMuonEndCap/src/PtAssignmentEngine2016.cc#L632-L647 )
@smuzaffar , I explained the issue in #29237. The quick summary is that back in Feb 2020, someone from L1T has made changes to the global tag used in 10_6_X for era=Run2_2016. This revealed some issue in the software (specifically, the BDT loaded from global tag did not match the software after the change). We fixed that and put the fix in both 10_6_X and 11_1_X. But it turned out that the global tag used in 11_1_X for era=Run2_2016 has not been updated accordingly. This means that, after the fix, the loaded BDT did not match the software again, but now in 11_1_X.
We are still waiting for #29237 to be resolved. In the meanwhile, perhaps I can try some incomplete fix? I'm not sure how to debug ASAN errors. If I modify the problematic lines at Node.cc#L327-L330 to do:
if (sv < e->data.size() && e->data[sv] < sp)
nextNode = left;
if (sv < e->data.size() && e->data[sv] >= sp)
nextNode = right;
Would it make the ASAN errors go away?
Thanks @jiafulow for the explaination. Sorry I did not read the #29237
You can reproduce the error in ASAN IBs by doing this
scram p CMSSW_11_2_ASAN_X_2020-05-22-2300
cd CMSSW_11_2_ASAN_X_2020-05-22-2300/
cmsenv
runTheMatrix.py -i all --ibeos -t 4 -l 1330.0
Yes your proposed fix should avoid the ASAN error but I would suggest the following change
- int sv = splitVariable;
+ unsigned int sv = splitVariable;
double sp = splitValue;
Node* left = leftDaughter;
Node* right = rightDaughter;
Node* nextNode = nullptr;
- if (left == nullptr || right == nullptr)
+ if (left == nullptr || right == nullptr || sv >= e->data.size())
@smuzaffar , thanks for the instructions. I tried your suggestion and it seems to work. Before the fix, I got:
1 0 0 0 0 tests passed, 0 1 0 0 0 failed
After the fix, I got:
1 1 1 1 1 tests passed, 0 0 0 0 0 failed
Would you implement the fix? Or should I submit a pull request?
Go ahead @jiafulow and sumbit the PR. Thanks,
Ok, I submitted a PR with your suggestion.
ASAN IBs are good now, we do not see this isssue any more
The address sanitizer has found the following problem
I got the stack trace by rebuilding the code with debug information on and then running workflow 1302.0 with just one thread.