missirol commented 6 months ago

In 2024, HLT would like to run online using the x86-64-v3 micro-architecture level in CMSSW (see https://github.com/cms-sw/cmssw/issues/43652, PDMVRELVALS-230 for more info). This is expected to bring an increase of 2.5% in the Run-3 HLT throughput, which is significant (see measurements by @fwyzard here).

The validation of this technical change is in progress. It is based on CMSSW_14_0_0_pre3.

Reference = CMSSW_14_0_0_pre3.
Target = CMSSW_14_0_0_pre3_MULTIARCHS (for target, one has to make sure that the machine in use supports x86-64-v3).

The POGs (and some PAGs) filed their validation reports, and they gave their greenlight (link).

TSG/STEAM (@sanuvarghese) compared HLT rates on 2023 data for CMSSW_14_0_0_pre3 vs CMSSW_14_0_0_pre3_MULTIARCHS (link), and found differences below 1% for most triggers, but several cases where differences in trigger decisions (considering also intermediate filters) exceeded 1%. This triggered further checks, which are being discussed in CMSHLT-3124 (worth reading, currently not long).

It was noticed that triggers using GSF tracks are amongst the ones showing the larger changes in trigger decisions (order of a few %). Some checks on a handful of events showed clear and reproducible differences in GSF tracks between CMSSW_14_0_0_pre3 vs CMSSW_14_0_0_pre3_MULTIARCHS. One reproducer showing differences for GSF tracks can be found in CMSHLT-3124 (comment).

I'm opening this issue to try and get more feedback from sw and reco experts. Two generic questions below.

Are these discrepancies something which is reasonable to expect, or has been seen before, with this kind of update ?
Are there any suggestions on software changes that could reduce these differences ?

Attn: @cms-sw/egamma-pog-l2 @cms-sw/tracking-pog-l2 FYI: @cms-sw/hlt-l2 @goys @silviodonato

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @missirol.

@Dr15Jones, @sextonkennedy, @makortel, @smuzaffar, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 6 months ago

assign hlt, reconstruction

cmsbuild commented 6 months ago

New categories assigned: hlt,reconstruction

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 6 months ago

type egamma, tracking

slava77 commented 6 months ago

I'm guessing that similar to the intel-vs-AMD case in cms-sw/cmssw#40089 and related cms-sw/cmsdist#8280 there will be a connection to compilation flags especially in the fast-math related subset.

It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.

@missirol @sanuvarghese please clarify if the tests you made were done on the same hardware (what kind of hardware was it?)

missirol commented 6 months ago

please clarify if the tests you made were done on the same hardware (what kind of hardware was it?)

Yes, I tested the reproducer in CMSHLT-3124 (comment) on lxplus806 (running with 14_0_0_pre3 and 14_0_0_pre3_MULTIARCHS on that same machine), and with this I see a difference in the trigger results of HLT_Ele30_WPTight_Gsf_v* between the two releases.

[lxplus806] > lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              10
On-line CPU(s) list: 0-9
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           10
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel Xeon Processor (Cascadelake)
Stepping:            6
CPU MHz:             2095.076
BogoMIPS:            4190.15
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-9
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke avx512_vnni md_clear arch_capabilities

fwyzard commented 6 months ago

It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.

By memory, the only difference in GCC flags is replacing -msse3 with -march=x86-64-v3 .

This enables instructions like SSE4.2, AVX, AVX2 and FMA.

VinInn commented 6 months ago

and the only reasonable reason is FMA. If is FMA one cannot state that one is correct and the other wrong: they are two different approximations.

VinInn commented 6 months ago

An easy messy test is to copy the "standard" libraries for GSF.... into the x86-64-v3 directory and see if the difference fades away.

VinInn commented 6 months ago

a bit of Frankenstein : cmsrel CMSSW_14_0_0_pre3_MULTIARCHS cd CMSSW_14_0_0_pre3_MULTIARCHS/ cmsenv git cms-addpkg TrackingTools/GsfTracking scram b clean scram b -j 8 edmPluginRefresh edmPluginDump -f | grep innocent cd ../lib/el8_amd64_gcc12/ "#copy sse libs in avx2 dir...." cp * scram_x86-64-v3/.

and then the comparison shows no difference

makortel commented 6 months ago

It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.

By memory, the only difference in GCC flags is replacing -msse3 with -march=x86-64-v3 .

@smuzaffar Probably more for future reference, where exactly are these specified? (I tried to look a little bit, but didn't find)

I'd argue FMA leading to ~3x difference in momentum is an indication of the algorithm(s) being (very) unstable. In a way the AMD vs Intel difference https://github.com/cms-sw/cmssw/issues/40089 was also a sign of such instabilities, but there the problem was removed by disabling the particular optimization. I guess in this case disabling the FMA would defeat the purpose though. In principle specific options such as -ffp-contract=off (e.g. https://stackoverflow.com/questions/42132260/generic-way-of-handling-fused-multiply-add-floating-point-inaccuracies and links therein) could be tested, but I'm not sure if that would really be worth of the effort.

VinInn commented 6 months ago

-ffp-contract=off will de-facto remove FMA defeating the purpose of the upgrade to AVX2. (it would require to manually introduce calls to "fma" everywhere is deemed useful).

GSF is unstable indeed in some corner of the phase space in particular for fake electrons and it is not proved at all that the more inaccurate answer is the one with FMA...

VinInn commented 6 months ago

btw I tested it on ARM that generates fma instructions (checked with objdump) and at least for that event it reproduces the trigger results of SSE not AVX2. We can check in all three cases the details of the GSF reconstruction

VinInn commented 6 months ago

TO BE FULLY VERIFIED!!!!

[innocent@patatrack01 src]$ git diff
diff --git a/DataFormats/Math/interface/SIMDVec.h b/DataFormats/Math/interface/SIMDVec.h
index e4c52cc301d..d2abc01b306 100644
--- a/DataFormats/Math/interface/SIMDVec.h
+++ b/DataFormats/Math/interface/SIMDVec.h
@@ -3,12 +3,8 @@

 #if (defined(__CLING__) || defined(__MIC__) || defined(__NVCC__)) || (__BIGGEST_ALIGNMENT__ < 16)
 #elif defined(__GNUC__) || defined(__clang__) || defined(__INTEL_COMPILER)
-#if defined(__x86_64__) && defined(__SSE__)
-#define USE_SSEVECT
-#else
 #define USE_EXTVECT
 #endif
-#endif

 // to be moved elsewhere
 namespace mathSSE {

remove the difference between x86-64-v3 and sse3

1) contraction is different 2) there is a bug in the AVX2 implementation of the mathSSE

Do not remember why we kept the SSE/AVX2 implementation when the EXT one was introduced and validated

EDIT: Apparently I was (un)Lucky: I run on one file ("0") with one event which happen to be the only one, out of 23, that flipped! So moving to EXTVECT makes Gsf different but not SO different.

smuzaffar commented 6 months ago

It would be nice if @cms-sw/core-l2 could provide some accounting between SSE3 and other GCC defaults compared to what it implies for x86-64-v3.

By memory, the only difference in GCC flags is replacing -msse3 with -march=x86-64-v3 .

@smuzaffar Probably more for future reference, where exactly are these specified? (I tried to look a little bit, but didn't find)

@makortel , these flags are defined in cmsdist . Default vectorization flag is sse3 and then for selected packages it is replaced by the correct vectorization flags

VinInn commented 6 months ago

I run on all 23 events on ARM as well and it is different from all x86_64 versions I tested above.

VinInn commented 6 months ago

EDIT:

my first conclusions were wrong: using f128 in SMATRIX makes no difference: just 40 times slower.

So, incredibly only after few hours of work I manage to make GSF working with SMatrix using __f128 https://github.com/cms-sw/cmssw/compare/master...VinInn:cmssw:GSF128?expand=1 (had also to patch root to allow type conversion in SMatrix an SVector)

for the 28 events above the results comparing see, avx2 and f128 are:

[innocent@patatrack01 hlt]$ hltDiff -o see.root -n avx.root
Found 23 matching events, out of which 19 have different HLT results
      Events    Accepted      Gained        Lost       Other  Trigger
          23          16          +7         -12           -  HLT_Ele30_WPTight_Gsf_v8
[innocent@patatrack01 hlt]$ hltDiff -o see.root -n f128.root  # starting from multi arch build
Found 23 matching events, out of which 18 have different HLT results
          23          16          +7         -11           -  HLT_Ele30_WPTight_Gsf_v8
[innocent@patatrack01 hlt]$ hltDiff -o avx.root -n f128.root  # both in multi arch
Found 23 matching events, out of which 1 have different HLT results
      Events    Accepted      Gained        Lost       Other  Trigger
          23          11          +1           -           -  HLT_Ele30_WPTight_Gsf_v8
[innocent@patatrack01 hlt]$ hltDiff -o see.root -n f128sse.root   # starting from the sse build
Found 23 matching events, out of which 0 have different HLT results

(some more unit tests can be made for some specific matrices)

mmusich commented 4 months ago

+hlt

deployed online via CMSSW_14_0_6_MULTIARCHS together with HLT menu GRun 2024 V1.1. The first stable-beams run CMSSW_14_0_6_MULTIARCHS, using the "v1.1.X" HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No issues were spotted.
An executive summary of the deployment history after validation can be found in https://its.cern.ch/jira/browse/CMSHLT-3170.

jfernan2 commented 4 months ago

+1

cmsbuild commented 4 months ago

This issue is fully signed and ready to be closed.

makortel commented 4 months ago

@cmsbuild, please close

cms-sw / cmssw

HLT Validation of `x86-64-v3` #44550

EDIT: