AXOL1TL related triggers fire on every event in CMSSW_14_1_X

mmusich commented 8 months ago

See https://github.com/cms-sw/cmssw/pull/44397#issuecomment-2001963521 for details.

mmusich commented 8 months ago

assign l1, hlt

cmsbuild commented 8 months ago

New categories assigned: l1,hlt

@Martin-Grunewald,@mmusich,@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 8 months ago

cms-bot internal usage

cmsbuild commented 8 months ago

A new Issue was created by @mmusich.

@sextonkennedy, @smuzaffar, @rappoccio, @makortel, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

aloeliger commented 8 months ago

Posted in #44397 and #44433, copied here by request

I'm catching up after other work yesterday afternoon.

Melissa has seen that removing the CICADA emulator changes AXO's answers.

The CICADA team (myself actually) caught this one earlier this week looking at CICADA firmware/emulator matching. We could run multiple CICADA emulators at the same time and get different answers from a model than we would running only one emulator. I couldn't figure it out and reverted to only doing one at a time because that's how the emulator works anyways and I suspected it was some strange internal CICADA version that could be sorted out at some point later.

What I suspect now now is that this issue is fundamentally symbol collision. Strictly, there isn't anything wrong with CICADA or AXO, the issue is that they share c++ code/symbols that are not specific to themselves in the dynamic linking case used by the emulator technique.

My theory on what is happening is this:

n 14_1 we updated the CICADA model to 1.1.1. One of the layers of CICADA 1.1.1 is weights w2, seen here: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/cicada.cpp#L36 https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/weights/w2.h#L12. Note that before, in model 2.1, w2 was not in use: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/myproject.cpp#L57

AXO, by extension, uses w2 (and b2) as well: https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/GTADModel_v3.cpp#L65 https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/weights.h#L5 (notably, they are also sized such that AXO could comfortably use CICADA's weights).

These weights are defined globally for both models. As the CMSSW job runs, CICADA gets loaded and created, and these weights loaded into the symbol table here by the dlopen here: https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/17790de8f2f2892dfd8ff20fead8eedd3cf59b49/src/hls4ml/emulator.cc#L21. Later, the AXO model gets created, and the load also attempted. I suspect that behind the scenes, dlopen sees that it would be loading things into the symbol table that are already defined, and quietly drops it in favor of things that are already defined.

In essence, when we upgraded CICADA to 1.1.1, by no real fault of our own, we ended up accidentally switching AXO to CICADA's weights.

I suspect a quick and dirty solution would be to do to CICADA what AXOL1TL has accidentally been doing (and we've all been asking AXO to stop ironically), which is create and destroy itself in one swoop, with no real persistence for model objects beyond creating a result.

In the longer run, the usage of this emulator technique needs to be fixed or changed.

artlbv commented 8 months ago

Also adding my comment from the Test PR here:

Right, so this AXO fix is what I expected.

But what I also was anticipating and seemingly got confirmed is that other hls4ml emulators might also be affected. See https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_14_1_X_2024-03-15-2300+e4d5b6/61716/triggerResults/25034.999_TTbar_14TeV+2026D98PU_PMXS1S2PR/HLT.log

Processed events: 9 out of 10 (90%) Found 10 matching events, out of which 1 have different HLT results

  Events    Accepted      Gained        Lost       Other  Trigger
      10           3           -          -1           -  pDoublePuppiTau52_52

this is a PHASE2 workflow that @aloeliger added recently for the p2GT emulator (thanks for this!!) We see the Tau seed gets affected and this is because it is based on an NN in HLS4ML…

so this is indeed something very hls4ml specific FYI @thesps @jmduarte @vloncar

artlbv commented 8 months ago

Also fyi @eyigitba @slaurila

artlbv commented 8 months ago

This phase2 thing also brings up another issue I was long wondering about: why do we run the phase1 L1 emulation in the phase2? From the L1 configuration it seems the SimL1 emulator always runs both phases..

makortel commented 8 months ago

These weights are defined globally for both models. As the CMSSW job runs, CICADA gets loaded and created, and these weights loaded into the symbol table here by the dlopen here: https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/17790de8f2f2892dfd8ff20fead8eedd3cf59b49/src/hls4ml/emulator.cc#L21. Later, the AXO model gets created, and the load also attempted. I suspect that behind the scenes, dlopen sees that it would be loading things into the symbol table that are already defined, and quietly drops it in favor of things that are already defined.

This is exactly what should happen if the dlopen() would be called with RTLD_GLOBAL, but here it is called with RTLD_LOCAL that was expected to not make the symbols "globally available"

This is the converse of RTLD_GLOBAL, and the default if neither flag is specified. Symbols defined in this shared object are not made available to resolve references in subsequently loaded shared objects.

(from https://man7.org/linux/man-pages/man3/dlopen.3.html)

Well, quick search indicates that even RTLD_LOCAL | RTLD_DEEPBIND is not always enough given how GCC implements things https://stackoverflow.com/questions/70660488/why-are-rtld-deepbind-and-rtld-local-not-preventing-collision-of-static-class-me . I also recall @fwyzard did some detailed study of the behavior of the various dlopen() options, but I wasn't able to find it now.

fwyzard commented 8 months ago

Later, the AXO model gets created, and the load also attempted. I suspect that behind the scenes, dlopen sees that it would be loading things into the symbol table that are already defined, and quietly drops it in favor of things that are already defined.

I don't think dlopen is to blame here: if I understand correctly the description of the issue(s), the underlying reason is that different models define the same global symbols with different types or values.

Having two different definition of the same global symbol is a violation of the C++ One Definition Rule.

If those symbols are not supposed to be visible outside of each model, possible solutions are:

mark the symbols as static
declare the symbols in an anonymous namespace { ... }

If those symbols must be visible, the easiest approach to avoid conflicts is to move all the symbols within a a different namespace, for each model and for each version of the model.

makortel commented 8 months ago

I suspect a quick and dirty solution would be to do to CICADA what AXOL1TL has accidentally been doing (and we've all been asking AXO to stop ironically), which is create and destroy itself in one swoop, with no real persistence for model objects beyond creating a result.

I don't think even this is guaranteed to work properly. What if two different models are being loaded and used concurrently in two threads?

I second @fwyzard's comment on using namespaces. In any case they are the standard way to deal disambiguate otherwise identical symbols, and thus easiest to guarantee a properly working setup.

In addition, global variables such as https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/weights/w2.h#L12 must be made const (or constexpr) when included in CMSSW.

fwyzard commented 8 months ago

Actually, why do they need to be global variables ? Can all the weights and other variables be declared as (possibly private) members of a class or struct ?

A model update that involves only a change in weights can reuse the same C++ type, and just use a different instance.

A model update that involves a change in the model structure would result in a different C++ type.

aloeliger commented 8 months ago

These are questions for the HLS4ML developers, I can't speak to the viability of any of them. This c++ code was originally intended for synthesis on FPGAs, and it is just by convenience of it being c++ that it was adapted for emulators. I really don't know if there are any of a dozen reasons that the HLS may not work for things like this.

I can attempt to try to recreate the issue, and insert namespaces around delicate pieces of the machinery and see if it fixes the issue, but no guarantees on that being a fast process.

aloeliger commented 8 months ago

Okay. I just want to make sure I have documented here some investigations I've done into this.

I looked at an instance of just running the CICADA emulator for version 2.1.1, it ran on a single thread, no other model should be running. In case it matters, it was running on a 2023 zero bias file: /store/data/Run2023D/ZeroBias/RAW/v1/000/369/869/ 00000/ebb4bfa3-c235-4534-95f5-5a83f52de1d2.root

I ran for 10 events, and got this as a set of baseline scores out of CICADA:

************************
*    Row   * anomalySc *
************************
*        0 *    0.1875 *
*        1 *         0 *
*        2 *         0 *
*        3 *     0.125 *
*        4 *         0 *
*        5 *     0.375 *
*        6 *   0.65625 *
*        7 *         0 *
*        8 *         0 *
*        9 *    0.6875 *
************************

I then setup a duplicate emulator to run, this time running an earlier model version v2.1.0. To be explicit, v2.1.0 runs in the official emulator path, and v2.1.1 was running in a duplicate emulator path both running in the same configuration. In any case, the new results from the original 2.1.1 model were then changed:

************************
*    Row   * anomalySc *
************************
*        0 *   0.09375 *
*        1 *    0.0625 *
*        2 *         0 *
*        3 *         0 *
*        4 *         0 *
*        5 *   0.09375 *
*        6 *    0.1875 *
*        7 *   0.03125 *
*        8 *    0.0625 *
*        9 *   0.09375 *
************************

I then tested recompiling the v2.1.0 model, this time, with a namespace around the weight files (i.e. putting namespace CICADA_v2p1, around the weights (example weight file: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/weights/w7.h), and making requisite changes in the actual model configuration where weights are defined in the model call). This compiles, and I reran the v2.1.0 + v2.1.1 bugged configuration. This time, I got the original answers back out of v2.1.1:

************************
*    Row   * anomalySc *
************************
*        0 *    0.1875 *
*        1 *         0 *
*        2 *         0 *
*        3 *     0.125 *
*        4 *         0 *
*        5 *     0.375 *
*        6 *   0.65625 *
*        7 *         0 *
*        8 *         0 *
*        9 *    0.6875 *
************************

It does seem like weight issues in this sort of trivial example are causing the model interference problem. CICADA and AXOL1TL could try to isolate these weights from each other as a first solution, but it will be a bit tedious. And I guess I can't promise that other elements aren't going to conflict somewhere either.

aloeliger commented 8 months ago

More documentation of investigation.

I have spoken with the HLS4ML developers and they are also surprised and about all of this, so I'm in the middle of doing more tests.

Two ideas have come from the HLS4ML developers. The first is that there is aready an HLS4ML model in CMSSW before we came up with this emulator technique, a NN Taus model, here: https://github.com/cms-sw/cmssw/tree/master/L1Trigger/Phase2L1ParticleFlow/interface/taus. They want to understand if having this around and already in CMSSW is itself responsible for interference behavior. The second is they want to see the effect of trying to namespace typedefs because that seems a likely spot for this issue to originate.

I had a few things I wanted to check as well, so I reintroduced the CICADA-CICADA model conflict, and put some debugging statements into the 2.1.1 model (currently being interfered with by a separate path running the 2.1.0 model).

With the interference present, this is it's summary of it's inputs, what it thinks one of it's smaller weight layers looks like, and the output it thinks it's producing:

Begin processing the 1st record. Run 369869, Event 55045026, LumiSection 180 on stream 0 at 19-Mar-2024 04:03:09.360 C
DT
Model 2.1.1 call!
I think my inputs are: 
0, 0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 
1, 1, 2, 2, 1, 3, 2, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 4, 4, 1, 2, 1, 0, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 2, 0, 2, 4, 2, 1, 4, 4, 0, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 5, 1, 1, 3, 1, 0, 
0, 0, 0, 1, 1, 11, 2, 2, 3, 0, 1, 5, 0, 1, 
0, 1, 0, 1, 1, 6, 1, 2, 0, 0, 7, 0, 0, 2, 
3, 0, 0, 1, 0, 4, 0, 4, 0, 0, 0, 2, 6, 0, 
3, 0, 0, 0, 0, 0, 6, 0, 2, 1, 3, 0, 3, 5, 
1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 2, 1, 2, 5, 1, 2, 0, 0, 
0, 0, 1, 0, 0, 0, 4, 2, 1, 0, 0, 0, 1, 0, 
1, 6, 0, 0, 1, 2, 15, 4, 1, 1, 1, 0, 0, 0, 
0, 2, 1, 1, 1, 3, 0, 0, 0, 1, 0, 1, 2, 0, 
0, 0, 1, 2, 7, 2, 4, 0, 0, 1, 0, 1, 2, 1, 
2, 8, 0, 1, 2, 0, 1, 3, 2, 0, 5, 0, 2, 1, 
0, 0, 4, 0, 1, 9, 2, 3, 1, 2, 3, 0, 0, 1, 
2, 0, 1, 1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 0, 

I think my w3 layer is: 
0.0102539, 0.0644531, 0.0078125, 0.0649414, 0.00537109, 0.0146484, -0.00976563, 0.0302734, 0.0170898, -0.19043, 0.2094
73, -0.0107422, 0.0361328, -0.0126953, 0.0200195, -0.0395508, 0.0224609, -0.0722656, 0.0327148, -0.11377, -0.00683594,
 -0.0742188, 0.0644531, 0.0146484, 0.0229492, 0.00292969, 0.0195313, 

I think my output is: 
0.09375

If I remove the interference, I get this:

Model 2.1.1 call!
I think my inputs are: 
0, 0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 
1, 1, 2, 2, 1, 3, 2, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 4, 4, 1, 2, 1, 0, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 2, 0, 2, 4, 2, 1, 4, 4, 0, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 5, 1, 1, 3, 1, 0, 
0, 0, 0, 1, 1, 11, 2, 2, 3, 0, 1, 5, 0, 1, 
0, 1, 0, 1, 1, 6, 1, 2, 0, 0, 7, 0, 0, 2, 
3, 0, 0, 1, 0, 4, 0, 4, 0, 0, 0, 2, 6, 0, 
3, 0, 0, 0, 0, 0, 6, 0, 2, 1, 3, 0, 3, 5, 
1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 2, 1, 2, 5, 1, 2, 0, 0, 
0, 0, 1, 0, 0, 0, 4, 2, 1, 0, 0, 0, 1, 0, 
1, 6, 0, 0, 1, 2, 15, 4, 1, 1, 1, 0, 0, 0, 
0, 2, 1, 1, 1, 3, 0, 0, 0, 1, 0, 1, 2, 0, 
0, 0, 1, 2, 7, 2, 4, 0, 0, 1, 0, 1, 2, 1, 
2, 8, 0, 1, 2, 0, 1, 3, 2, 0, 5, 0, 2, 1, 
0, 0, 4, 0, 1, 9, 2, 3, 1, 2, 3, 0, 0, 1, 
2, 0, 1, 1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 0, 
I think my w3 layer is: 
0.0209961, 0.129395, 0.015625, 0.129883, 0.0112305, 0.0292969, -0.019043, 0.0610352, 0.0341797, -0.380859, -0.581055, 
-0.0209961, 0.0727539, -0.0249023, 0.0405273, -0.0791016, 0.0454102, -0.144531, 0.0654297, -0.227051, -0.0131836, -0.1
47949, 0.128906, 0.0297852, 0.0458984, 0.00634766, 0.0390625, 
I think my output is: 
0.1875

The inputs themselves are not being interfered with (although both 2.1.0 and 2.1.1. use a 10 bit fixed int format for input), however, the weights of the layer are obviously different. Curiously, the first few weights in the interfering models case seem to be almost exactly half of the weights in the no interference case.

I think it's worth noting that none of the no interference case weights are exactly what is present in the CICADA 2.1.1 weight specification:

https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1p1/weights/w3.h#L12C24-L12C32

But those weights are not expressible in the reduced precision specified for FPGA types, while the quoted weights seem to be.

Also worth noting, given the approximate power of two difference, is that the interfering model, 2.1.0, defines these weights to be 10 bits long, all decimal values right of the decimal point, and model being interfered with, 2.1.1 defines these weights to be 16 bits long, 11 of those right of the decimal point.

2.1.0: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/defines.h#L26 2.1.1: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1p1/defines.h#L31

Which lead me to believe the namespaced typedefs may be the most "accurate" solution to this issue.

Inserting those namespaces around the typedef'ed defines (https://github.com/cms-hls4ml/CICADA/blob/main/CICADA_v2p1p1/defines.h), and structs(https://github.com/cms-hls4ml/CICADA/blob/main/CICADA_v2p1p1/parameters.h), (and other requisite model usage areas) recompiling, and rerunning with the "interference" still present:

Model 2.1.1 call!
I think my inputs are: 
0, 0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 
1, 1, 2, 2, 1, 3, 2, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 4, 4, 1, 2, 1, 0, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 2, 0, 2, 4, 2, 1, 4, 4, 0, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 5, 1, 1, 3, 1, 0, 
0, 0, 0, 1, 1, 11, 2, 2, 3, 0, 1, 5, 0, 1, 
0, 1, 0, 1, 1, 6, 1, 2, 0, 0, 7, 0, 0, 2, 
3, 0, 0, 1, 0, 4, 0, 4, 0, 0, 0, 2, 6, 0, 
3, 0, 0, 0, 0, 0, 6, 0, 2, 1, 3, 0, 3, 5, 
1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 2, 1, 2, 5, 1, 2, 0, 0, 
0, 0, 1, 0, 0, 0, 4, 2, 1, 0, 0, 0, 1, 0, 
1, 6, 0, 0, 1, 2, 15, 4, 1, 1, 1, 0, 0, 0, 
0, 2, 1, 1, 1, 3, 0, 0, 0, 1, 0, 1, 2, 0, 
0, 0, 1, 2, 7, 2, 4, 0, 0, 1, 0, 1, 2, 1, 
2, 8, 0, 1, 2, 0, 1, 3, 2, 0, 5, 0, 2, 1, 
0, 0, 4, 0, 1, 9, 2, 3, 1, 2, 3, 0, 0, 1, 
2, 0, 1, 1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 0, 
I think the length of something in w3 layer is: 
16
I think my w3 layer is: 
0.0102539, 0.0644531, 0.0078125, 0.0649414, 0.00537109, 0.0146484, -0.00976563, 0.0302734, 0.0170898, -0.19043, 0.2094
73, -0.0107422, 0.0361328, -0.0126953, 0.0200195, -0.0395508, 0.0224609, -0.0722656, 0.0327148, -0.11377, -0.00683594,
 -0.0742188, 0.0644531, 0.0146484, 0.0229492, 0.00292969, 0.0195313, 
I think my output is: 
0.09375

Same interference as before.

Note I also added a statement about how big it thinks the type it is storing the weight in is (https://docs.amd.com/r/en-US/ug1399-vitis-hls/Other-Class-Methods-Operators-and-Data-Members). It is correctly assessing this type as the 16 bit weight type from 2.1.1, and not as a 10 bit type from 2.1.0. That part's a little confusing to me.

Inserting the namespace around the weights again:

Begin processing the 1st record. Run 369869, Event 55045026, LumiSection 180 on stream 0 at 19-Mar-2024 04:54:21.900 C
DT
Model 2.1.1 call!
I think my inputs are: 
0, 0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 
1, 1, 2, 2, 1, 3, 2, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 4, 4, 1, 2, 1, 0, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 2, 0, 2, 4, 2, 1, 4, 4, 0, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 5, 1, 1, 3, 1, 0, 
0, 0, 0, 1, 1, 11, 2, 2, 3, 0, 1, 5, 0, 1, 
0, 1, 0, 1, 1, 6, 1, 2, 0, 0, 7, 0, 0, 2, 
3, 0, 0, 1, 0, 4, 0, 4, 0, 0, 0, 2, 6, 0, 
3, 0, 0, 0, 0, 0, 6, 0, 2, 1, 3, 0, 3, 5, 
1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 2, 1, 2, 5, 1, 2, 0, 0, 
0, 0, 1, 0, 0, 0, 4, 2, 1, 0, 0, 0, 1, 0, 
1, 6, 0, 0, 1, 2, 15, 4, 1, 1, 1, 0, 0, 0, 
0, 2, 1, 1, 1, 3, 0, 0, 0, 1, 0, 1, 2, 0, 
0, 0, 1, 2, 7, 2, 4, 0, 0, 1, 0, 1, 2, 1, 
2, 8, 0, 1, 2, 0, 1, 3, 2, 0, 5, 0, 2, 1, 
0, 0, 4, 0, 1, 9, 2, 3, 1, 2, 3, 0, 0, 1, 
2, 0, 1, 1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 0, 
I think the length of something in w3 layer is: 
16
I think my w3 layer is: 
0.0209961, 0.129395, 0.015625, 0.129883, 0.0112305, 0.0292969, -0.019043, 0.0610352, 0.0341797, -0.380859, -0.581055, 
-0.0209961, 0.0727539, -0.0249023, 0.0405273, -0.0791016, 0.0454102, -0.144531, 0.0654297, -0.227051, -0.0131836, -0.1
47949, 0.128906, 0.0297852, 0.0458984, 0.00634766, 0.0390625, 
I think my output is: 
0.1875

Again drops the interference seen.

fwyzard commented 8 months ago

the HLS4ML developers and they are also surprised and about all of this

Suggested readings:

The first is that there is aready an HLS4ML model in CMSSW before we came up with this emulator technique, a NN Taus model, here: https://github.com/cms-sw/cmssw/tree/master/L1Trigger/Phase2L1ParticleFlow/interface/taus. They want to understand if having this around and already in CMSSW is itself responsible for interference behavior.

That model is surely affected by the same problem, since it uses plenty of unscoped variables in the global namespace.

Maybe https://github.com/cms-sw/cmssw/pull/43639 should have been reviewed more accurately.

The second is they want to see the effect of trying to namespace typedefs because that seems a likely spot for this issue to originate.

What do you mean by "namespace typedefs" ?

aloeliger commented 8 months ago

Yet more documentation of investigation into the issue:

I pulled the namespaces back out of the weights to re-introduce the interference again, and this time cracked open the internals of the 2.1.0 model. 2.1.0 and 2.1.1 are just bug-fix versions of each other, but I was somewhat surprised to discover their weights are actually the same for w3 layer (https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/weights/w3.h#L12), so I decided to test whose weights are actually being used when I run 2.1.1

I forcibly set the first three weights of the w3 layer to 0, and reran the output of the 2.1.1 model with the interference present:

Begin processing the 1st record. Run 369869, Event 55045026, LumiSection 180 on stream 0 at 19-Mar-2024 05:04:25.618 C
DT
Model 2.1.1 call!
I think my inputs are: 
0, 0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 
1, 1, 2, 2, 1, 3, 2, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 4, 4, 1, 2, 1, 0, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 2, 0, 2, 4, 2, 1, 4, 4, 0, 0, 0, 
0, 0, 0, 0, 1, 1, 1, 1, 5, 1, 1, 3, 1, 0, 
0, 0, 0, 1, 1, 11, 2, 2, 3, 0, 1, 5, 0, 1, 
0, 1, 0, 1, 1, 6, 1, 2, 0, 0, 7, 0, 0, 2, 
3, 0, 0, 1, 0, 4, 0, 4, 0, 0, 0, 2, 6, 0, 
3, 0, 0, 0, 0, 0, 6, 0, 2, 1, 3, 0, 3, 5, 
1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 2, 1, 2, 5, 1, 2, 0, 0, 
0, 0, 1, 0, 0, 0, 4, 2, 1, 0, 0, 0, 1, 0, 
1, 6, 0, 0, 1, 2, 15, 4, 1, 1, 1, 0, 0, 0, 
0, 2, 1, 1, 1, 3, 0, 0, 0, 1, 0, 1, 2, 0, 
0, 0, 1, 2, 7, 2, 4, 0, 0, 1, 0, 1, 2, 1, 
2, 8, 0, 1, 2, 0, 1, 3, 2, 0, 5, 0, 2, 1, 
0, 0, 4, 0, 1, 9, 2, 3, 1, 2, 3, 0, 0, 1, 
2, 0, 1, 1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 0, 
I think the length of something in w3 layer is: 
16
I think my w3 layer is: 
0, 0, 0, 0.0649414, 0.00537109, 0.0146484, -0.00976563, 0.0302734, 0.0170898, -0.19043, 0.209473, -0.0107422, 0.036132
8, -0.0126953, 0.0200195, -0.0395508, 0.0224609, -0.0722656, 0.0327148, -0.11377, -0.00683594, -0.0742188, 0.0644531, 
0.0146484, 0.0229492, 0.00292969, 0.0195313, 
I think my output is: 
0

Note the first 3 weights of 2.1.1 have now been set to 0, which was only edited into the 2.1.0 model.

I think then what is happening here, in this scenario is this: 2.1.0 gets loaded first, it's types defined, and weights populated. Later, when 2.1.1 gets loaded, the types are redefined, and the weights are not repopulated, but repurposed, and I think reinterpreted/cast to the type the new model uses. In our case, what was 10 bits worth of decimal has an extra bit of decimal shoved in front, and integer bits appended to the left of that. In short, other models weights are used while being cast into new types. The fact that this hasn't crashed something yet is likely just a coincidence.

aloeliger commented 7 months ago

I have prepared updates to the CICADA external (https://github.com/cms-hls4ml/CICADA/pull/3), and made PRs to the externals (https://github.com/cms-sw/cmsdist/pull/9087, https://github.com/cms-sw/cmsdist/pull/9088), to add namespaces to weights and types to prevent interference from or with CICADA to other hls4ml triggers.

Some discussion with hls4ml developers is ongoing about this solution.

fwyzard commented 7 months ago

hi @aloeliger, from a very quick look at the changes, I noticed that the weights are now in different namespaces, but the myproject(...) function is not.

Is it safe to keep that in a global namespace, with the possibility of conflicts ?

aloeliger commented 7 months ago

hi @aloeliger, from a very quick look at the changes, I noticed that the weights are now in different namespaces, but the myproject(...) function is not.

Is it safe to keep that in a global namespace, with the possibility of conflicts ?

@fwyzard Yeah, that should be namespace-d away too for the models where it is generically defined. Probably should be namespace-d away even where it isn't just for the sake of it. Thanks, good catch.

aloeliger commented 7 months ago

Okay. The recent updates to the externals PR (https://github.com/cms-hls4ml/CICADA/pull/4 and https://github.com/cms-hls4ml/CICADA/releases/tag/v1.3.1) namespace away the main function for CICADA as well, which should prevent this issue from happening in the future with CICADA.

artlbv commented 7 months ago

FYI there is also a similar PR adding protections to the AXOL1TL models:

PR to master: https://github.com/cms-sw/cmsdist/pull/9091
Backport: https://github.com/cms-sw/cmsdist/pull/9092

(copying from https://github.com/cms-sw/cmssw/pull/44510#issuecomment-2017821688 ) To summarise the test results with the CICADA/AXO cmsdist PRs: 7/10 of the different trigger results are due to the Phase-1 workflows where AXO HLT trigger numbers change (from triggering on every event due to the collision to giving reasonable/expected values) and 3/10 are due to Phase-2 workflows where the NNTau algorithm was affected by the HLS4ML collision.

p.s. The CICADA update alone fixes both Phase-1 and Phase-2 trigger results: https://github.com/cms-sw/cmsdist/pull/9087#issuecomment-2015380336, while the AXO update only affected the Phase-1 workflows as expected: https://github.com/cms-sw/cmsdist/pull/9091#issuecomment-2015818047

aloeliger commented 7 months ago

FYI there is also a similar PR adding protections to the AXOL1TL models:

PR to master: AXOL1TL v3.0.2 cmsdist#9091

Backport: [14_0_X] AXOL1TL v3.0.2 cmsdist#9092

(copying from #44510 (comment) ) To summarise the test results with the CICADA/AXO cmsdist PRs: 7/10 of the different trigger results are due to the Phase-1 workflows where AXO HLT trigger numbers change (from triggering on every event due to the collision to giving reasonable/expected values) and 3/10 are due to Phase-2 workflows where the NNTau algorithm was affected by the HLS4ML collision.

p.s. The CICADA update alone fixes both Phase-1 and Phase-2 trigger results: cms-sw/cmsdist#9087 (comment), while the AXO update only affected the Phase-1 workflows as expected: cms-sw/cmsdist#9091 (comment)

The phase 2 seen from CICADA is likely because the phase 2 modifications do not explicitly replace/remove the CICADA the calo summary card emulator. This is solvable, but we should understand how entangled Phase 1 and Phase 2 are.

aloeliger commented 7 months ago

As of the merging of https://github.com/cms-sw/cmsdist/pull/9087 and https://github.com/cms-sw/cmsdist/pull/9088 (and https://github.com/cms-sw/cmsdist/pull/9091 and https://github.com/cms-sw/cmsdist/pull/9092), I believe the immediate crisis of this issue has been handled. The L1T community has discussed somewhat yesterday however the need for vigilance on the problems that caused the issue going forward.

mmusich commented 7 months ago

+hlt

see https://github.com/cms-sw/cmssw/issues/44435#issuecomment-2022401025

aloeliger commented 7 months ago

+l1

cmsbuild commented 7 months ago

This issue is fully signed and ready to be closed.

makortel commented 7 months ago

@cmsbuild, please close

cms-sw / cmssw

AXOL1TL related triggers fire on every event in CMSSW_14_1_X #44435