Closed missirol closed 4 months ago
cms-bot internal usage
A new Issue was created by @missirol.
@Dr15Jones, @sextonkennedy, @makortel, @smuzaffar, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, reconstruction
New categories assigned: hlt,reconstruction
@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
type egamma, tracking
I'm guessing that similar to the intel-vs-AMD case in cms-sw/cmssw#40089 and related cms-sw/cmsdist#8280 there will be a connection to compilation flags especially in the fast-math related subset.
It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.
@missirol @sanuvarghese please clarify if the tests you made were done on the same hardware (what kind of hardware was it?)
please clarify if the tests you made were done on the same hardware (what kind of hardware was it?)
Yes, I tested the reproducer in CMSHLT-3124 (comment) on lxplus806
(running with 14_0_0_pre3
and 14_0_0_pre3_MULTIARCHS
on that same machine), and with this I see a difference in the trigger results of HLT_Ele30_WPTight_Gsf_v*
between the two releases.
[lxplus806] > lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 10
On-line CPU(s) list: 0-9
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 10
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel Xeon Processor (Cascadelake)
Stepping: 6
CPU MHz: 2095.076
BogoMIPS: 4190.15
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-9
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke avx512_vnni md_clear arch_capabilities
It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.
By memory, the only difference in GCC flags is replacing -msse3
with -march=x86-64-v3
.
This enables instructions like SSE4.2, AVX, AVX2 and FMA.
and the only reasonable reason is FMA. If is FMA one cannot state that one is correct and the other wrong: they are two different approximations.
An easy messy test is to copy the "standard" libraries for GSF.... into the x86-64-v3 directory and see if the difference fades away.
a bit of Frankenstein : cmsrel CMSSW_14_0_0_pre3_MULTIARCHS cd CMSSW_14_0_0_pre3_MULTIARCHS/ cmsenv git cms-addpkg TrackingTools/GsfTracking scram b clean scram b -j 8 edmPluginRefresh edmPluginDump -f | grep innocent cd ../lib/el8_amd64_gcc12/ "#copy sse libs in avx2 dir...." cp * scram_x86-64-v3/.
and then the comparison shows no difference
It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.
By memory, the only difference in GCC flags is replacing
-msse3
with-march=x86-64-v3
.
@smuzaffar Probably more for future reference, where exactly are these specified? (I tried to look a little bit, but didn't find)
I'd argue FMA leading to ~3x difference in momentum is an indication of the algorithm(s) being (very) unstable. In a way the AMD vs Intel difference https://github.com/cms-sw/cmssw/issues/40089 was also a sign of such instabilities, but there the problem was removed by disabling the particular optimization. I guess in this case disabling the FMA would defeat the purpose though. In principle specific options such as -ffp-contract=off
(e.g. https://stackoverflow.com/questions/42132260/generic-way-of-handling-fused-multiply-add-floating-point-inaccuracies and links therein) could be tested, but I'm not sure if that would really be worth of the effort.
-ffp-contract=off
will de-facto remove FMA defeating the purpose of the upgrade to AVX2.
(it would require to manually introduce calls to "fma" everywhere is deemed useful).
GSF is unstable indeed in some corner of the phase space in particular for fake electrons and it is not proved at all that the more inaccurate answer is the one with FMA...
btw I tested it on ARM that generates fma instructions (checked with objdump) and at least for that event it reproduces the trigger results of SSE not AVX2. We can check in all three cases the details of the GSF reconstruction
TO BE FULLY VERIFIED!!!!
[innocent@patatrack01 src]$ git diff
diff --git a/DataFormats/Math/interface/SIMDVec.h b/DataFormats/Math/interface/SIMDVec.h
index e4c52cc301d..d2abc01b306 100644
--- a/DataFormats/Math/interface/SIMDVec.h
+++ b/DataFormats/Math/interface/SIMDVec.h
@@ -3,12 +3,8 @@
#if (defined(__CLING__) || defined(__MIC__) || defined(__NVCC__)) || (__BIGGEST_ALIGNMENT__ < 16)
#elif defined(__GNUC__) || defined(__clang__) || defined(__INTEL_COMPILER)
-#if defined(__x86_64__) && defined(__SSE__)
-#define USE_SSEVECT
-#else
#define USE_EXTVECT
#endif
-#endif
// to be moved elsewhere
namespace mathSSE {
remove the difference between x86-64-v3 and sse3
1) contraction is different 2) there is a bug in the AVX2 implementation of the mathSSE
Do not remember why we kept the SSE/AVX2 implementation when the EXT one was introduced and validated
EDIT: Apparently I was (un)Lucky: I run on one file ("0") with one event which happen to be the only one, out of 23, that flipped! So moving to EXTVECT makes Gsf different but not SO different.
It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.
By memory, the only difference in GCC flags is replacing
-msse3
with-march=x86-64-v3
.@smuzaffar Probably more for future reference, where exactly are these specified? (I tried to look a little bit, but didn't find)
@makortel , these flags are defined in cmsdist . Default vectorization flag is sse3 and then for selected packages it is replaced by the correct vectorization flags
I run on all 23 events on ARM as well and it is different from all x86_64 versions I tested above.
my first conclusions were wrong: using f128 in SMATRIX makes no difference: just 40 times slower.
So, incredibly only after few hours of work I manage to make GSF working with SMatrix using __f128 https://github.com/cms-sw/cmssw/compare/master...VinInn:cmssw:GSF128?expand=1 (had also to patch root to allow type conversion in SMatrix an SVector)
for the 28 events above the results comparing see, avx2 and f128 are:
[innocent@patatrack01 hlt]$ hltDiff -o see.root -n avx.root
Found 23 matching events, out of which 19 have different HLT results
Events Accepted Gained Lost Other Trigger
23 16 +7 -12 - HLT_Ele30_WPTight_Gsf_v8
[innocent@patatrack01 hlt]$ hltDiff -o see.root -n f128.root # starting from multi arch build
Found 23 matching events, out of which 18 have different HLT results
23 16 +7 -11 - HLT_Ele30_WPTight_Gsf_v8
[innocent@patatrack01 hlt]$ hltDiff -o avx.root -n f128.root # both in multi arch
Found 23 matching events, out of which 1 have different HLT results
Events Accepted Gained Lost Other Trigger
23 11 +1 - - HLT_Ele30_WPTight_Gsf_v8
[innocent@patatrack01 hlt]$ hltDiff -o see.root -n f128sse.root # starting from the sse build
Found 23 matching events, out of which 0 have different HLT results
(some more unit tests can be made for some specific matrices)
+hlt
CMSSW_14_0_6_MULTIARCHS
together with HLT menu GRun 2024 V1.1. The first stable-beams run CMSSW_14_0_6_MULTIARCHS
, using the "v1.1.X" HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No issues were spotted.+1
This issue is fully signed and ready to be closed.
@cmsbuild, please close
In 2024, HLT would like to run online using the
x86-64-v3
micro-architecture level in CMSSW (see https://github.com/cms-sw/cmssw/issues/43652, PDMVRELVALS-230 for more info). This is expected to bring an increase of 2.5% in the Run-3 HLT throughput, which is significant (see measurements by @fwyzard here).The validation of this technical change is in progress. It is based on
CMSSW_14_0_0_pre3
.CMSSW_14_0_0_pre3
.CMSSW_14_0_0_pre3_MULTIARCHS
(for target, one has to make sure that the machine in use supportsx86-64-v3
).The POGs (and some PAGs) filed their validation reports, and they gave their greenlight (link).
TSG/STEAM (@sanuvarghese) compared HLT rates on 2023 data for
CMSSW_14_0_0_pre3
vsCMSSW_14_0_0_pre3_MULTIARCHS
(link), and found differences below 1% for most triggers, but several cases where differences in trigger decisions (considering also intermediate filters) exceeded 1%. This triggered further checks, which are being discussed in CMSHLT-3124 (worth reading, currently not long).It was noticed that triggers using GSF tracks are amongst the ones showing the larger changes in trigger decisions (order of a few %). Some checks on a handful of events showed clear and reproducible differences in GSF tracks between
CMSSW_14_0_0_pre3
vsCMSSW_14_0_0_pre3_MULTIARCHS
. One reproducer showing differences for GSF tracks can be found in CMSHLT-3124 (comment).I'm opening this issue to try and get more feedback from sw and reco experts. Two generic questions below.
Attn: @cms-sw/egamma-pog-l2 @cms-sw/tracking-pog-l2 FYI: @cms-sw/hlt-l2 @goys @silviodonato