LDMX-Software / ldmx-sw

The Light Dark Matter eXperiment simulation and reconstruction framework.
https://ldmx-software.github.io
GNU General Public License v3.0
22 stars 21 forks source link

compiler cleanliness patches and incorporation of HLS APT #1470

Closed tomeichlersmith closed 2 months ago

tomeichlersmith commented 2 months ago

replace this with a while loop instead to be more clear about the possibility of rounding errors affecting performance

For some reason, this only was shown on trunk after merging in unrelated (but in TrigScint?) code #1461 .

Check List

tomeichlersmith commented 2 months ago

Canceled the PR Validation since I am just making sure this builds with warnings-as-errors.

tvami commented 2 months ago

For some reason, this only was shown on trunk

Is it because the branch Rory used was not rebased to the version of trunk that had my update for the CI?

tomeichlersmith commented 2 months ago

That's why I think this wasn't caught while test-building that branch, but I still don't know why this wasn't caught in previous rounds of more strict building since this code has been there for awhile. :shrug:

Doesn't really matter, this is an easy fix.

tomeichlersmith commented 2 months ago

crap, this just revealed more issues

https://github.com/LDMX-Software/ldmx-sw/actions/runs/10960598567/job/30435575518?pr=1470

the good news is that it builds and runs when not treating warnings-as-errors and so its not super urgent

tvami commented 2 months ago

Some of the new stuff is in HLS_arbitrary_Precision_Types, this line https://github.com/LDMX-Software/ldmx-sw/blob/trunk/CMakeLists.txt#L37C36-L37C65 was supressing those in the past but I think with this addition https://github.com/LDMX-Software/ldmx-sw/blob/5598eaa3a4129dda761cf8375b1e948cdc76a6d3/TrigScint/CMakeLists.txt#L32-L37 it's resurfucing again

tomeichlersmith commented 2 months ago

I fixed up our internal stuff, all that's left is the HLS stuff.

/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/etc/ap_private.h:2109:32: warning: The left operand of '<<' is a garbage value [clang-analyzer-core.UndefinedBinaryOperatorResult]
            ? ((((int64_t)VAL) << (excess_bits)) >> (excess_bits))
                               ^
/home/tom/code/ldmx/ldmx-sw/TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx:293:14: note: Calling 'operator+<12, true, 12, true>'
  float pe = outTrk.Pad1.Seed.Amp + outTrk.Pad1.Sec.Amp;
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/ap_int_base.h:1320:1: note: Calling constructor for 'ap_int_base<13, true>'
OP_BIN_AP(+, plus)
^
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/ap_int_base.h:1310:37: note: expanded from macro 'OP_BIN_AP'
        _AP_W2, _AP_S2>::Rty##_base lhs(op);                                  \
                                    ^~~~~~~
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/ap_int_base.h:179:10: note: Calling default constructor for 'ssdm_int_sim<13, true>'
  INLINE ap_int_base(const ap_int_base<_AP_W2, _AP_S2>& op) {
         ^~~~~~~~~~~
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/ap_common.h:248:3: note: Calling default constructor for 'ap_private<13, true, true>'
  ssdm_int_sim() {}
  ^~~~~~~~~~~~
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/etc/ap_private.h:1595:5: note: Calling 'ap_private::clearUnusedBits'
    clearUnusedBits();
    ^~~~~~~~~~~~~~~~~
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/etc/ap_private.h:2108:9: note: '?' condition is true
        _AP_S
        ^
/home/tom/code/ldmx/ldmx-sw/TrigScint/../Trigger/HLS_arbitrary_Precision_Types/include/etc/ap_private.h:2109:32: note: The left operand of '<<' is a garbage value
            ? ((((int64_t)VAL) << (excess_bits)) >> (excess_bits))
                          ~~~  ^

which, as you point out @tvami , can probably be silenced by informing CMake that the HLS headers are not our problem.

tvami commented 2 months ago

So in the end, this is not just something we can hide under the rug, there is some issue in the input in

TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx

the track level seems fine but either at the outTrk.Pad1 or outTrk.Pad1.Seed or outTrk.Pad1.Seed.Amp (same with the Sec) there is something uninitialized. I didnt find it tho :(

tomeichlersmith commented 2 months ago

This gave me the idea to just use {} within the struct declarations telling C++ to use value initialization which avoids nonsense (random memory) values. e.g.

https://github.com/LDMX-Software/ldmx-sw/blob/6c08a49cb7a0bd8cff03f3e838f91f0536261c99/TrigScint/include/TrigScint/Firmware/objdef.h#L26-L30

I /think/ this got past the error on my local computer so I'm pushing it here to double check.

Edit: nevermind. This didn't work locally.

tvami commented 2 months ago

I was thinking if we could just set the defaults it would fix it, but given that this is C-style, I dont think we can do that. I think having

ap_int<12> mID{0}, bID{-1};

would actually solve the problem, but we cant do that, right?

tomeichlersmith commented 2 months ago

I don't know if this code needs to be C-style or if that was just most convenient when trying to write the emulation.

tvami commented 2 months ago

OK, I tried, it doesnt work anyway... :(

tvami commented 2 months ago

I tried if getting rid of calcTCent would resolve it, but then I get this (meaning that the issue roots in constructing the lookup table )

/home/vamitamas/patch-float-for-loop-counter/ldmx-sw/TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx:63:3: note: Loop condition is true.  Entering loop body
  for (int i = 0; i < NCENT; i++) {
  ^
/home/vamitamas/patch-float-for-loop-counter/ldmx-sw/TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx:64:5: note: Loop condition is true.  Entering loop body
    for (int j = 0; j < COMBO; j++) {
    ^
/home/vamitamas/patch-float-for-loop-counter/ldmx-sw/TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx:65:26: note: Calling 'operator-<12, true>'
      LOOKUP[i][j][0] = (i - A[1] + A[0]);
                         ^~~~~~~~
tvami commented 2 months ago

And honestly I dont understand what's happening here

  ap_int<12> A[3] = {0, 0, 0};
  ap_int<12> LOOKUP[NCENT][COMBO][2];

  LOOKUP[i][j][0] = (i - A[1] + A[0]);

isnt A[1] the same as A[0] and both are zeros?

rodwyer100 commented 2 months ago

A[i] is an alignment matrix. The idea is that the three layers (Pad1, Pad2, and Pad3) may be translated w.r.t eachother and said translation may be determined by the vector A as we only care how they are misaligned in one axis (the axis of granularity of the TS). The factor (i - A[1] + A[0]) only becomes nontrivial if A[1] and A[0] are different indicating an initially misaligned state. This will be important when we first start checking LDMX because if there is alignment issues we wont see it without scanning A.

tvami commented 2 months ago

I see, thanks @rodwyer100 ! Would it be ok to initialize the LOOKUP table to all 0 before you fill in the real values?

tvami commented 2 months ago

ehh

/home/vamitamas/patch-float-for-loop-counter/ldmx-sw/TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx:62:3: note: Loop condition is true.  Entering loop body
  for (int i = 0; i < NCENT; ++i) {
  ^
    for (int j = 0; j < COMBO; ++j) {
    ^
      for (int k = 0; k < 2; ++k) {
      ^
/home/vamitamas/patch-float-for-loop-counter/ldmx-sw/TrigScint/src/TrigScint/TrigScintFirmwareTracker.cxx:65:31: note: Calling constructor for 'ap_int<12>'
            LOOKUP[i][j][k] = ap_int<12>(0);
                              ^~~~~~~~~~~~~

it doesnt like anything with ap_int<12> constructor.

tvami commented 2 months ago

hey @tomeichlersmith @rodwyer100 @bryngemark At this point I think we should bring the HLS submodule to be part of the ldmx-sw, and fix the VAL thing (I did it locally and it resolves it). The HLS repo was last touched 5 years ago, and it's not very big [funny thing that Shazhad from CMS did try to patch it a year ago -- the PR is still unmerged]. I also think it would have a natural place in Framework or under Tools, given that it's used both by Trigger and TrigScint

rodwyer100 commented 2 months ago

It should be okay to initialize it with certain values. I didn't understand your most current comment. Did that not work?

rodwyer100 commented 2 months ago

Sorry now your second to most current. Edit: Ah I see. It doesn't like the ap_int<12> instantiation. I don't see why it wouldn't; Ill look at it but it should be allowed.

rodwyer100 commented 2 months ago

So something has been scrambled. I don't know which change did it, but I will need to inspect things before this gets commited. The track number distribution is very high; it is producing a high fake rate.

rodwyer100 commented 2 months ago

image I will need to look event by event to see why this occurs.

tomeichlersmith commented 2 months ago

Personally, I would rather keep a fork of HLS_arbitrary_precision and just redirect our submodule to our fork.

Similar to G4DarkBreM, I just think there is a possibility that this code would be used in other applications (LDMX or not) and so keeping it separate would be helpful.

tvami commented 2 months ago

rather keep a fork of HLS_arbitrary_precision

but it's such a small repo... having it as a submodule has the advantage that if it changes we can just update it whenever we think it's needed. But this did not and does not change even when it should (see the PR from Shahzad a year ago). And then for your argument about using in other applications: for outside ldmx I doubt people would take an ldmx-sw fork instead of having their own fork. And then for inside ldmx, isnt that an argument to have it in our software directly?

tomeichlersmith commented 2 months ago

Yea.. you've convinced me

I'll put them in Tools and update the CMake

tomeichlersmith commented 2 months ago

@rodwyer100 I am going to merge this even though it certainly contains the issues that you have observed just because there is a lot of code being introduced. I think your other PR https://github.com/LDMX-Software/ldmx-sw/pull/1473 can be a location for patches while you are introducing the hit stagger.

rodwyer100 commented 2 months ago

I figured out what happened. Its a small change thats needed which won't change what was done. I will include it in my hit branch alongside a rebased branch